It’s easy to run Llama 2 on Beam. This example runs the 7B parameter model on a 24Gi A10G GPU, and caches the model weights in a Storage Volume.

Pre-requisites

  1. Request access to the model on Huggingface
  2. Add your Huggingface API token to the Beam Secrets Manager, as HUGGINGFACE_API_KEY

Clone this example

Beam provides a repo of examples, and you can clone this example app by running this command:

beam create-app llama2 

cd into the new llama2 directory.

Run inference

You can run a single inference on Beam with the beam run command:

beam run app.py:generate -d '{"prompt": "Summarize rail travel in the United States"}'

Deploy the API

Let’s deploy this as a web API, so we can make requests from the outside world.

beam deploy app.py:generate

After deploying, you’ll be able to copy a cURL request to hit the API.