In this example, we are going to use vLLM to host an OpenAI compatible InternVL3 8B API on Beam.

View the Code

See the code for this example on Github.

Introduction to vLLM

vLLM is a high-performance, easy-to-use library for LLM inference. It can be up to 24 times faster than HuggingFace’s Transformers library and it allows you to easily setup an OpenAI compatible API for your LLM. Additionally, a number of LLMs (like Llama 3.1) support LoRA. This means that you can easily follow our LoRA guide and host your resulting model using vLLM.

The key to vLLM’s performance is Paged Attention. In LLMs, input tokens produce attention keys and value tensors, which are typically stored in GPU memory. Paged Attention stores these continuous keys and values in non-contiguous memory by partitioning them into blocks that are fetched on a need-to-use basis.

Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc

Hosting an OpenAI-Compatible Chat API with vLLM

With vLLM, we can host a fully functional chat API that we can use with already built SDKs to interact with. You could build this functionality yourself, but vLLM provides a great out of the box solution as well.

Initial Setup

To get started with vLLM on Beam, we can use the VLLM class from the Beam SDK. This class supports all of the flags and arguments of the vLLM command line tool as arguments.

Setup Compute Environment

Let’s take a look at the code required to deploy the OpenGVLab/InternVL3-8B-AWQ model with an efficient configuration. We start by defining the environment and the necessary arguments for our vLLM server.

models.py
from beam.integrations import VLLM, VLLMArgs

MODEL_ID = "OpenGVLab/InternVL3-8B-AWQ"

vllm_server = VLLM(
    name=MODEL_ID.split("/")[-1],
    cpu=4,
    memory="16Gi",
    gpu="A10G",
    gpu_count=1,
    workers=1,
    vllm_args=VLLMArgs(
        model=MODEL_ID,
        served_model_name=[MODEL_ID],
        trust_remote_code=True,
        max_model_len=4096,
        gpu_memory_utilization=0.90,
        limit_mm_per_prompt={"image": 2},
        quantization="awq",
        max_num_batched_tokens=8192,
    )
)

Key Configuration Parameters:

  • name: A descriptive name for your Beam application.
  • cpu: Number of CPU cores allocated (e.g., 4).
  • memory: Amount of memory allocated (e.g., “16Gi”).
  • gpu: Type of GPU to use (e.g., “A10G”).
  • gpu_count: Number of GPUs (e.g., 1).
  • workers: Number of worker processes for vLLM (e.g., 1).
  • vllm_args: Arguments passed directly to the vLLM engine:
    • model: The Hugging Face model identifier.
    • served_model_name: Name under which the model is served.
    • trust_remote_code: Allows the model to execute custom code if required.
    • max_model_len: Maximum token sequence length for the model.
    • gpu_memory_utilization: Target GPU memory utilization (e.g., 0.90 for 90%).
    • limit_mm_per_prompt: (If applicable) Limits for multi-modal inputs.
    • quantization: Enables model quantization (e.g., “awq”). This is often beneficial even if the model name suggests it’s pre-quantized, as vLLM handles the specifics.
    • max_num_batched_tokens: Sets the capacity for tokens in a batch for dynamic batching (e.g., 8192).

Equivalent vLLM Command Line (for reference):

The VLLM integration in Beam simplifies deployment. If you were to run a similar configuration using the vllm serve command-line tool directly, some of the corresponding arguments would be:

vllm serve OpenGVLab/InternVL3-8B-AWQ \
    --trust-remote-code \
    --max-model-len 4096 \
    --limit-mm-per-prompt image=2 \
    --quantization awq \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.90
# Note: Parameters like cpu, memory, gpu_count, and workers are managed by Beam's infrastructure.

Deploying the API

To deploy our model, we can run the following command:

beam deploy models.py:internvl

The output will look like this:

=> Building image
=> Using cached image
=> Syncing files
Reading .beamignore file
Collecting files from /Users/minzi/Dev/beam/ex-repo/vllm
Added /Users/minzi/Dev/beam/ex-repo/vllm/models.py
Added /Users/minzi/Dev/beam/ex-repo/vllm/tool_chat_template_mistral.jinja
Added /Users/minzi/Dev/beam/ex-repo/vllm/README.md
Added /Users/minzi/Dev/beam/ex-repo/vllm/chat.py
Added /Users/minzi/Dev/beam/ex-repo/vllm/inference.py
Collected object is 14.46 KB
=> Files already synced
=> Deploying
=> Deployed 🎉
=> Invocation details
curl -X POST 'https://internvl-15c4487-v4.app.beam.cloud' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_TOKEN' \
-d '{}'

Using the API

Pre-requisites

Once your function is deployed, you can interact with it using the OpenAI Python client.

To get started, you can clone the example repository and run the chat.py script.

Make sure you have the openai library installed locally, since that is how we interact with the deployed API.

git clone https://github.com/beam-cloud/examples.git
cd examples/vllm
pip install openai
python chat.py

Starting a Dialogue

You will be greeted with a prompt to enter the URL of your deployed function.

Once you enter the URL, the container will initialize on Beam and you will be able to interact with the model.

Welcome to the CLI Chat Application!

Type 'quit' to exit the conversation.

Enter the app URL: https://internvl-instruct-15c4487-v3.app.beam.cloud

Model OpenGVLab/InternVL2_5-8B is ready

Question: What is in this image?

Image link (press enter to skip): https://upload.wikimedia.org/wikipedia/commons/7/74/White_domesticated_duck,_stretching.jpg

Assistant:  The image you've shared is of a white duck standing on a grassy field. The duck, with its distinctive orange beak and feet, is facing to the left.

To host other models, you can simply change the arguments you pass into the VLLM class.

from beam.integrations import VLLM, VLLMArgs

YI_CODER_CHAT = "01-ai/Yi-Coder-9B-Chat"

yicoder_chat = VLLM(
    name=YI_CODER_CHAT.split("/")[-1],
    cpu=8,
    memory="16Gi",
    gpu="A100-40",
    vllm_args=VLLMArgs(
        model=YI_CODER_CHAT,
        served_model_name=[YI_CODER_CHAT],
        task="chat",
        trust_remote_code=True,
        max_model_len=8096,
    ),
)