LLaMA 3.1 8B
This guide demonstrates how to run the Meta Llama 3 8B Instruct model on Beam. Note that this is a gated Huggingface model, and you must request access to it.
View the Code
See the code for this example on Github.
Introduction
Meta Llama 3 8B Instruct is a powerful language model that requires access through Huggingface. Follow these steps to set up and deploy the model on Beam.
Prerequisites
- Request Access: Request access to the model here.
- Retrieve HF Token: Get your Huggingface token from this page.
- Save HF Token on Beam: Use the command
beam secret create HF_TOKEN [TOKEN]
to save your token.
Setup Remote Environment
The first thing we’ll do is set up an Image
with the Python packages required for this app.
We use the if env.is_remote()
flag to conditionally import the Python packages only when the script is running remotely on Beam.
Inference Function
Here’s the inference function. By adding the @endpoint()
decorator to it, we can expose this function as a RESTful API.
Note the secrets
argument which ensures the Huggingface token is loaded into the environment.
Serving the API
In your shell, serve the API by running:
This command will:
- Spin up a container.
- Run it with the specified CPU, memory, and GPU resources.
- Sync your local files to the remote container.
- Print a cURL request to invoke the API.
- Stream logs to your shell.
Invoking the API
Once the API is running, you can invoke it using the following cURL command:
Replace [ENDPOINT-ID] with your actual endpoint ID and [AUTH-TOKEN] with your authentication token.
Deploy to Production
The beam serve command is used for temporary APIs. When you’re ready to move to production, deploy a persistent endpoint:
Summary
You’ve successfully set up a highly performant serverless API for generating text using the Meta Llama 3 8B Instruct model on Beam.
Was this page helpful?