In this example we are going to use vLLM to host an OpenAI compatible InternVL2.5 8B API on Beam.

View the Code

See the code for this example on Github.

Introduction to vLLM

vLLM is a high-performance, easy-to-use library for LLM inference. It can be up to 24 times faster than HuggingFace’s Transformers library and it allows you to easily setup an OpenAI compatible API for your LLM. Additionally, a number of LLMs (like Llama 3.1) support LoRA. This means that you can easily follow our LoRA guide and host your resulting model using vLLM.

The key to vLLM’s performance is Paged Attention. In LLMs, input tokens produce attention keys and value tensors, which are typically stored in GPU memory. Paged Attention stores these continuous keys and values in non-contiguous memory by partitioning them into blocks that are fetched on a need-to-use basis.

Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc

Hosting an OpenAI-Compatible Chat API with vLLM

With vLLM, we can host a fully functional chat API that we can use with already built SDKs to interact with. You could build this functionality yourself, but vLLM provides a great out of the box solution as well.

Initial Setup

To get started with vLLM on Beam, we can use the VLLM class from the Beam SDK. This class supports all of the flags and arguments of the vLLM command line tool as arguments.

Setup Compute Environment

Let’s take a look at the code required to deploy the InternVL2.5 8B model from OpenGVLab. Just like a normal Beam application, we start by defining the environment. For this model, we will use 8 CPU, 16GB of memory, and two A10G GPU. With those details set, we can focus on what arguments we need to pass to our vLLM server.

models.py
from beam.integrations import VLLM, VLLMArgs

INTERNVL2_5 = "OpenGVLab/InternVL2_5-8B"

internvl = VLLM(
    name=INTERNVL2_5.split("/")[-1],
    cpu=8,
    memory="32Gi",
    gpu="A10G",
    gpu_count=2,
    vllm_args=VLLMArgs(
        model=INTERNVL2_5,
        served_model_name=[INTERNVL2_5],
        trust_remote_code=True,
        max_model_len=4096,
        gpu_memory_utilization=0.95,
        limit_mm_per_prompt={"image": 2},
    )
)

The first argument we need to set is the model. Then, for this model, we will set the trust_remote_code to True since it will be using tool calling functionality. Finally, we will set the max_model_len to 4096, which is the maximum number of tokens that can be used in a single request and the limit_mm_per_prompt to 2, which limits the number of images that can be used in a single request.

The equivalent vLLM command line tool command would be:

vllm serve OpenGVLab/InternVL2_5-8B --trust-remote-code \
--max-model-len 4096 --limit-mm-per-prompt image=2

Deploying the API

To deploy our model, we can run the following command:

beam deploy models.py:internvl

The output will look like this:

=> Building image
=> Using cached image
=> Syncing files
Reading .beamignore file
Collecting files from /Users/minzi/Dev/beam/ex-repo/vllm
Added /Users/minzi/Dev/beam/ex-repo/vllm/models.py
Added /Users/minzi/Dev/beam/ex-repo/vllm/tool_chat_template_mistral.jinja
Added /Users/minzi/Dev/beam/ex-repo/vllm/README.md
Added /Users/minzi/Dev/beam/ex-repo/vllm/chat.py
Added /Users/minzi/Dev/beam/ex-repo/vllm/inference.py
Collected object is 14.46 KB
=> Files already synced
=> Deploying
=> Deployed 🎉
=> Invocation details
curl -X POST 'https://internvl-15c4487-v4.app.beam.cloud' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_TOKEN' \
-d '{}'

Using the API

Pre-requisites

Once your function is deployed, you can interact with it using the OpenAI Python client.

To get started, you can clone the example repository and run the chat.py script.

Make sure you have the openai library installed locally, since that is how we interact with the deployed API.

git clone https://github.com/beam-cloud/examples.git
cd examples/vllm
pip install openai
python chat.py  

Starting a Dialogue

You will be greeted with a prompt to enter the URL of your deployed function.

Once you enter the URL, the container will initialize on Beam and you will be able to interact with the model.

Welcome to the CLI Chat Application!

Type 'quit' to exit the conversation.

Enter the app URL: https://internvl-instruct-15c4487-v3.app.beam.cloud

Model OpenGVLab/InternVL2_5-8B is ready

Question: What is in this image?

Image link (press enter to skip): https://upload.wikimedia.org/wikipedia/commons/7/74/White_domesticated_duck,_stretching.jpg

Assistant:  The image you've shared is of a white duck standing on a grassy field. The duck, with its distinctive orange beak and feet, is facing to the left. 

To host other models, you can simply change the arguments you pass into the VLLM class.