> ## Documentation Index
> Fetch the complete documentation index at: https://docs.beam.cloud/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen2.5-7B with SGLang

This guide demonstrates how to deploy a high-performance language model server using [SGLang](https://github.com/sgl-project/sglang) with the [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model from Qwen. The server runs on Beam, providing an OpenAI-compatible API endpoint for text generation.

<Card title="View the Code" icon="github" href="https://github.com/beam-cloud/examples/tree/main/sglang/">
  See the full code for this example on GitHub.
</Card>

## Overview

SGLang is a fast inference framework for large language models, optimized for low latency and high throughput. We use it to serve the Qwen2.5-7B-Instruct model, a 7-billion-parameter instruction-tuned model, on an H100 GPU via Beam’s `Pod` abstraction.

A test script demonstrates interaction with the server using the OpenAI Python client.

## Setup

First, create a file named `app.py`:

```python theme={null}
from beam import Image, Pod

# Image of SGLang and dependencies
image = (
    Image(python_version="python3.11")
    .add_python_packages([
        "transformers==4.47.1",
        "numpy<2",
        "fastapi[standard]==0.115.4",
        "pydantic==2.9.2",
        "starlette==0.41.2",
        "torch==2.4.0",
    ])
    .add_commands([
        'pip install "sglang[all]==0.4.1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/'
    ])
)

# Define the SGLang server Beam Pod
sglang_server = Pod(
    image=image,
    ports=[8080],
    cpu=12,
    memory="32Gi",
    gpu="H100",
    secrets=["HF_TOKEN"],
    entrypoint=[
        "python",
        "-m",
        "sglang.launch_server",
        "--model-path",
        "Qwen/Qwen2.5-7B-Instruct",
        "--port",
        "8080",
        "--host",
        "0.0.0.0",
    ],
)

# Deploy the pod
res = sglang_server.create()

print("✨ SGLang server hosted at:", res.url)
```

## Deployment

Deploy the server using the Beam CLI:

```bash theme={null}
python app.py
```

Here's the expected output, with the URL of the deployed app:

```bash theme={null}
=> Files synced
=> Creating container
=> Container created successfully ===> pod-b451fa2f-3c4a-47e0-bb37-333434fds22b66-add2d058
=> This container will timeout after 600 seconds.
=> Invocation details
curl -X POST 'https://b451fa2f-3c4a-47e0-bb37-333434fds22b66-8080.app.beam.cloud' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-d '{}'
✨ SGLang server hosted at: https://b451fa2f-3c4a-47e0-bb37-333434fds22b66-8080.app.beam.cloud
```

## API Usage

The SGLang server exposes an OpenAI-compatible API at `/v1`. You can interact with it using the OpenAI Python client or any HTTP client.

### Test Script

Create a file named `test.py` to test the deployed server:

```python theme={null}
import openai

# Initialize OpenAI client with Beam endpoint and Beam API key
client = openai.Client(
    base_url="https://35b937b9-1a70-4343-89d9-1125b1290e4d-8080.app.beam.cloud/v1",
    api_key="BEAM_API_KEY",  # Replace with your actual Beam API key
)

# Send a chat completion request
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

# Print the response
print(response.choices[0].message.content)
```

#### Running the Test

1. Replace `BEAM_API_KEY` with your actual Beam API key.
2. Update the `base_url` with your deployed pod’s URL.
3. Install the OpenAI client locally:

```bash theme={null}
pip install openai
```

4. Run the script:

```bash theme={null}
python test.py
```

Expected output for the prompt, *"List 3 countries and their capitals"*.

```
1. France - Paris
2. Japan - Tokyo
3. Brazil - Brasília
```
