Run an OpenAI-Compatible vLLM Server
In this example we are going to use vLLM to host an OpenAI compatible InternVL2.5 8B API on Beam.
View the Code
See the code for this example on Github.
Introduction to vLLM
vLLM is a high-performance, easy-to-use library for LLM inference. It can be up to 24 times faster than HuggingFace’s Transformers library and it allows you to easily setup an OpenAI compatible API for your LLM. Additionally, a number of LLMs (like Llama 3.1) support LoRA. This means that you can easily follow our LoRA guide and host your resulting model using vLLM.
The key to vLLM’s performance is Paged Attention. In LLMs, input tokens produce attention keys and value tensors, which are typically stored in GPU memory. Paged Attention stores these continuous keys and values in non-contiguous memory by partitioning them into blocks that are fetched on a need-to-use basis.
Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc
Hosting an OpenAI-Compatible Chat API with vLLM
With vLLM, we can host a fully functional chat API that we can use with already built SDKs to interact with. You could build this functionality yourself, but vLLM provides a great out of the box solution as well.
Initial Setup
To get started with vLLM on Beam, we can use the VLLM
class from the Beam SDK. This class supports all of the flags and arguments of the vLLM command line tool as arguments.
Setup Compute Environment
Let’s take a look at the code required to deploy the InternVL2.5 8B model from OpenGVLab. Just like a normal Beam application, we start by defining the environment. For this model, we will use 8 CPU, 16GB of memory, and two A10G GPU. With those details set, we can focus on what arguments we need to pass to our vLLM server.
The first argument we need to set is the model. Then, for this model, we will set the trust_remote_code
to True
since it will be using tool calling functionality. Finally, we will set the max_model_len
to 4096, which is the maximum number of tokens that can be used in a single request and the limit_mm_per_prompt
to 2, which limits the number of images that can be used in a single request.
The equivalent vLLM command line tool command would be:
Deploying the API
To deploy our model, we can run the following command:
The output will look like this:
Using the API
Pre-requisites
Once your function is deployed, you can interact with it using the OpenAI Python client.
To get started, you can clone the example repository and run the chat.py
script.
Make sure you have the openai
library installed locally, since that is how we interact with the deployed API.
Starting a Dialogue
You will be greeted with a prompt to enter the URL of your deployed function.
Once you enter the URL, the container will initialize on Beam and you will be able to interact with the model.
To host other models, you can simply change the arguments you pass into the VLLM
class.
Was this page helpful?