Run an OpenAI-Compatible vLLM Server
In this example, we are going to use vLLM to host an OpenAI compatible InternVL3 8B API on Beam.
View the Code
See the code for this example on Github.
Introduction to vLLM
vLLM is a high-performance, easy-to-use library for LLM inference. It can be up to 24 times faster than HuggingFace’s Transformers library and it allows you to easily setup an OpenAI compatible API for your LLM. Additionally, a number of LLMs (like Llama 3.1) support LoRA. This means that you can easily follow our LoRA guide and host your resulting model using vLLM.
The key to vLLM’s performance is Paged Attention. In LLMs, input tokens produce attention keys and value tensors, which are typically stored in GPU memory. Paged Attention stores these continuous keys and values in non-contiguous memory by partitioning them into blocks that are fetched on a need-to-use basis.
Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc
Hosting an OpenAI-Compatible Chat API with vLLM
With vLLM, we can host a fully functional chat API that we can use with already built SDKs to interact with. You could build this functionality yourself, but vLLM provides a great out of the box solution as well.
Initial Setup
To get started with vLLM on Beam, we can use the VLLM
class from the Beam SDK. This class supports all of the flags and arguments of the vLLM command line tool as arguments.
Setup Compute Environment
Let’s take a look at the code required to deploy the OpenGVLab/InternVL3-8B-AWQ
model with an efficient configuration. We start by defining the environment and the necessary arguments for our vLLM server.
Key Configuration Parameters:
name
: A descriptive name for your Beam application.cpu
: Number of CPU cores allocated (e.g., 4).memory
: Amount of memory allocated (e.g., “16Gi”).gpu
: Type of GPU to use (e.g., “A10G”).gpu_count
: Number of GPUs (e.g., 1).workers
: Number of worker processes for vLLM (e.g., 1).vllm_args
: Arguments passed directly to the vLLM engine:model
: The Hugging Face model identifier.served_model_name
: Name under which the model is served.trust_remote_code
: Allows the model to execute custom code if required.max_model_len
: Maximum token sequence length for the model.gpu_memory_utilization
: Target GPU memory utilization (e.g., 0.90 for 90%).limit_mm_per_prompt
: (If applicable) Limits for multi-modal inputs.quantization
: Enables model quantization (e.g., “awq”). This is often beneficial even if the model name suggests it’s pre-quantized, as vLLM handles the specifics.max_num_batched_tokens
: Sets the capacity for tokens in a batch for dynamic batching (e.g., 8192).
Equivalent vLLM Command Line (for reference):
The VLLM
integration in Beam simplifies deployment. If you were to run a similar configuration using the vllm serve
command-line tool directly, some of the corresponding arguments would be:
Deploying the API
To deploy our model, we can run the following command:
The output will look like this:
Using the API
Pre-requisites
Once your function is deployed, you can interact with it using the OpenAI Python client.
To get started, you can clone the example repository and run the chat.py
script.
Make sure you have the openai
library installed locally, since that is how
we interact with the deployed API.
Starting a Dialogue
You will be greeted with a prompt to enter the URL of your deployed function.
Once you enter the URL, the container will initialize on Beam and you will be able to interact with the model.
To host other models, you can simply change the arguments you pass into the VLLM
class.