> ## Documentation Index > Fetch the complete documentation index at: https://docs.beam.cloud/llms.txt > Use this file to discover all available pages before exploring further. # Qwen2.5-7B with SGLang This guide demonstrates how to deploy a high-performance language model server using [SGLang](https://github.com/sgl-project/sglang) with the [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model from Qwen. The server runs on Beam, providing an OpenAI-compatible API endpoint for text generation. See the full code for this example on GitHub. ## Overview SGLang is a fast inference framework for large language models, optimized for low latency and high throughput. We use it to serve the Qwen2.5-7B-Instruct model, a 7-billion-parameter instruction-tuned model, on an H100 GPU via Beam’s `Pod` abstraction. A test script demonstrates interaction with the server using the OpenAI Python client. ## Setup First, create a file named `app.py`: ```python theme={null} from beam import Image, Pod # Image of SGLang and dependencies image = ( Image(python_version="python3.11") .add_python_packages([ "transformers==4.47.1", "numpy<2", "fastapi[standard]==0.115.4", "pydantic==2.9.2", "starlette==0.41.2", "torch==2.4.0", ]) .add_commands([ 'pip install "sglang[all]==0.4.1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/' ]) ) # Define the SGLang server Beam Pod sglang_server = Pod( image=image, ports=[8080], cpu=12, memory="32Gi", gpu="H100", secrets=["HF_TOKEN"], entrypoint=[ "python", "-m", "sglang.launch_server", "--model-path", "Qwen/Qwen2.5-7B-Instruct", "--port", "8080", "--host", "0.0.0.0", ], ) # Deploy the pod res = sglang_server.create() print("SGLang server hosted at:", res.url) ``` ## Deployment Deploy the server using the Beam CLI: ```bash theme={null} python app.py ``` Here's the expected output, with the URL of the deployed app: ```bash theme={null} => Files synced => Creating container => Container created successfully ===> pod-b451fa2f-3c4a-47e0-bb37-333434fds22b66-add2d058 => This container will timeout after 600 seconds. => Invocation details curl -X POST 'https://b451fa2f-3c4a-47e0-bb37-333434fds22b66-8080.app.beam.cloud' \ -H 'Connection: keep-alive' \ -H 'Content-Type: application/json' \ -d '{}' SGLang server hosted at: https://b451fa2f-3c4a-47e0-bb37-333434fds22b66-8080.app.beam.cloud ``` ## API Usage The SGLang server exposes an OpenAI-compatible API at `/v1`. You can interact with it using the OpenAI Python client or any HTTP client. ### Test Script Create a file named `test.py` to test the deployed server: ```python theme={null} import openai # Initialize OpenAI client with Beam endpoint and Beam API key client = openai.Client( base_url="https://35b937b9-1a70-4343-89d9-1125b1290e4d-8080.app.beam.cloud/v1", api_key="BEAM_API_KEY", # Replace with your actual Beam API key ) # Send a chat completion request response = client.chat.completions.create( model="Qwen/Qwen2.5-7B-Instruct", messages=[ {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0, max_tokens=64, ) # Print the response print(response.choices[0].message.content) ``` #### Running the Test 1. Replace `BEAM_API_KEY` with your actual Beam API key. 2. Update the `base_url` with your deployed pod’s URL. 3. Install the OpenAI client locally: ```bash theme={null} pip install openai ``` 4. Run the script: ```bash theme={null} python test.py ``` Expected output for the prompt, *"List 3 countries and their capitals"*. ``` 1. France - Paris 2. Japan - Tokyo 3. Brazil - Brasília ```