This guide demonstrates how to deploy a high-performance language model server using SGLang with the Qwen2.5-7B-Instruct model from Qwen. The server runs on Beam, providing an OpenAI-compatible API endpoint for text generation.
SGLang is a fast inference framework for large language models, optimized for low latency and high throughput. We use it to serve the Qwen2.5-7B-Instruct model, a 7-billion-parameter instruction-tuned model, on an A100-40 GPU via Beam’s Pod abstraction.
A test script demonstrates interaction with the server using the OpenAI Python client.
Create a file named test.py to test the deployed server:
import openai# Initialize OpenAI client with Beam endpoint and Beam API keyclient = openai.Client( base_url="https://35b937b9-1a70-4343-89d9-1125b1290e4d-8080.app.beam.cloud/v1", api_key="BEAM_API_KEY", # Replace with your actual Beam API key)# Send a chat completion requestresponse = client.chat.completions.create( model="Qwen/Qwen2.5-7B-Instruct", messages=[ {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0, max_tokens=64,)# Print the responseprint(response.choices[0].message.content)