Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc
VLLM
class from the Beam SDK. This class supports all of the flags and arguments of the vLLM command line tool as arguments.
OpenGVLab/InternVL3-8B-AWQ
model with an efficient configuration. We start by defining the environment and the necessary arguments for our vLLM server.
name
: A descriptive name for your Beam application.cpu
: Number of CPU cores allocated (e.g., 4).memory
: Amount of memory allocated (e.g., “16Gi”).gpu
: Type of GPU to use (e.g., “A10G”).gpu_count
: Number of GPUs (e.g., 1).workers
: Number of worker processes for vLLM (e.g., 1).vllm_args
: Arguments passed directly to the vLLM engine:
model
: The Hugging Face model identifier.served_model_name
: Name under which the model is served.trust_remote_code
: Allows the model to execute custom code if required.max_model_len
: Maximum token sequence length for the model.gpu_memory_utilization
: Target GPU memory utilization (e.g., 0.90 for 90%).limit_mm_per_prompt
: (If applicable) Limits for multi-modal inputs.quantization
: Enables model quantization (e.g., “awq”). This is often beneficial even if the model name suggests it’s pre-quantized, as vLLM handles the specifics.max_num_batched_tokens
: Sets the capacity for tokens in a batch for dynamic batching (e.g., 8192).VLLM
integration in Beam simplifies deployment. If you were to run a similar configuration using the vllm serve
command-line tool directly, some of the corresponding arguments would be:
chat.py
script.
openai
library installed locally, since that is how
we interact with the deployed API.VLLM
class.