Qwen2.5-7B with SGLang
This guide demonstrates how to deploy a high-performance language model server using SGLang with the Qwen2.5-7B-Instruct model from Qwen. The server runs on Beam, providing an OpenAI-compatible API endpoint for text generation.
View the Code
See the full code for this example on GitHub.
Overview
SGLang is a fast inference framework for large language models, optimized for low latency and high throughput. We use it to serve the Qwen2.5-7B-Instruct model, a 7-billion-parameter instruction-tuned model, on an A100-40 GPU via Beam’s Pod
abstraction.
A test script demonstrates interaction with the server using the OpenAI Python client.
Setup
First, create a file named app.py
:
Deployment
Deploy the server using the Beam CLI:
Here’s the expected output, with the URL of the deployed app:
API Usage
The SGLang server exposes an OpenAI-compatible API at /v1
. You can interact with it using the OpenAI Python client or any HTTP client.
Test Script
Create a file named test.py
to test the deployed server:
Running the Test
- Replace
BEAM_API_KEY
with your actual Beam API key. - Update the
base_url
with your deployed pod’s URL. - Install the OpenAI client locally:
- Run the script:
Expected output for the prompt, “List 3 countries and their capitals”.
Was this page helpful?