LLMs
Llama 2 Inference
It’s easy to run Llama 2 on Beam. This example runs the 7B parameter model on a 24Gi A10G
GPU, and caches the model weights in a Storage Volume.
Pre-requisites
- Request access to the model on Huggingface
- Add your Huggingface API token to the Beam Secrets Manager, as
HUGGINGFACE_API_KEY
Clone this example
Beam provides a repo of examples, and you can clone this example app by running this command:
beam create-app llama2
cd
into the new llama2
directory.
Run inference
You can run a single inference on Beam with the beam run
command:
beam run app.py:generate -d '{"prompt": "Summarize rail travel in the United States"}'
Deploy the API
Let’s deploy this as a web API, so we can make requests from the outside world.
beam deploy app.py:generate
After deploying, you’ll be able to copy a cURL request to hit the API.

Was this page helpful?