This guide demonstrates how to deploy a high-performance language model server using SGLang with the Qwen2.5-7B-Instruct model from Qwen. The server runs on Beam, providing an OpenAI-compatible API endpoint for text generation.

View the Code

See the full code for this example on GitHub.

Overview

SGLang is a fast inference framework for large language models, optimized for low latency and high throughput. We use it to serve the Qwen2.5-7B-Instruct model, a 7-billion-parameter instruction-tuned model, on an A100-40 GPU via Beam’s Pod abstraction.

A test script demonstrates interaction with the server using the OpenAI Python client.

Setup

First, create a file named app.py:

from beam import Image, Pod

# Image of SGLang and dependencies
image = (
    Image(python_version="python3.11")
    .add_python_packages([
        "transformers==4.47.1",
        "numpy<2",
        "fastapi[standard]==0.115.4",
        "pydantic==2.9.2",
        "starlette==0.41.2",
        "torch==2.4.0",
    ])
    .add_commands([
        'pip install "sglang[all]==0.4.1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/'
    ])
)

# Define the SGLang server Beam Pod
streamlit_server = Pod(
    image=image,
    ports=[8080],
    cpu=12,
    memory="32Gi",
    gpu="A100-40",
    secrets=["HF_TOKEN"],
    entrypoint=[
        "python",
        "-m",
        "sglang.launch_server",
        "--model-path",
        "Qwen/Qwen2.5-7B-Instruct",
        "--port",
        "8080",
        "--host",
        "0.0.0.0",
    ],
)

# Deploy the pod
res = streamlit_server.create()

print("✨ SGLang server hosted at:", res.url)

Deployment

Deploy the server using the Beam CLI:

python app.py

Here’s the expected output, with the URL of the deployed app:

=> Files synced
=> Creating container
=> Container created successfully ===> pod-b451fa2f-3c4a-47e0-bb37-333434fds22b66-add2d058
=> This container will timeout after 600 seconds.
=> Invocation details
curl -X POST 'https://b451fa2f-3c4a-47e0-bb37-333434fds22b66-8080.app.beam.cloud' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-d '{}'
✨ Streamlit server hosted at: https://b451fa2f-3c4a-47e0-bb37-333434fds22b66-8080.app.beam.cloud

API Usage

The SGLang server exposes an OpenAI-compatible API at /v1. You can interact with it using the OpenAI Python client or any HTTP client.

Test Script

Create a file named test.py to test the deployed server:

import openai

# Initialize OpenAI client with Beam endpoint and Beam API key
client = openai.Client(
    base_url="https://35b937b9-1a70-4343-89d9-1125b1290e4d-8080.app.beam.cloud/v1",
    api_key="BEAM_API_KEY",  # Replace with your actual Beam API key
)

# Send a chat completion request
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

# Print the response
print(response.choices[0].message.content)

Running the Test

  1. Replace BEAM_API_KEY with your actual Beam API key.
  2. Update the base_url with your deployed pod’s URL.
  3. Install the OpenAI client locally:
pip install openai
  1. Run the script:
python test.py

Expected output for the prompt, “List 3 countries and their capitals”.

1. France - Paris
2. Japan - Tokyo
3. Brazil - Brasília