LLaMA 3.1 8B
This guide demonstrates how to run the Meta Llama 3 8B Instruct model on Beam. Note that this is a gated Huggingface model, and you must request access to it.
View the Code
See the code for this example on Github.
Introduction
Meta Llama 3 8B Instruct is a powerful language model that requires access through Huggingface. Follow these steps to set up and deploy the model on Beam.
Prerequisites
- Request Access: Request access to the model here.
- Retrieve HF Token: Get your Huggingface token from this page.
- Save HF Token on Beam: Use the command
beam secret create HF_TOKEN [TOKEN]
to save your token.
Setup Remote Environment
The first thing we’ll do is set up an Image
with the Python packages required for this app.
We use the if env.is_remote()
flag to conditionally import the Python packages only when the script is running remotely on Beam.
from beam import endpoint, Image, Volume, env
# This ensures that these packages are only loaded when the script is running remotely on Beam
if env.is_remote():
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Model parameters
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
MAX_LENGTH = 512
TEMPERATURE = 1.0
TOP_P = 0.95
TOP_K = 40
REPETITION_PENALTY = 1.0
NO_REPEAT_NGRAM_SIZE = 0
DO_SAMPLE = True
BEAM_VOLUME_PATH = "./cached_models"
# This runs once when the container first starts
def load_models():
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
torch_dtype=torch.float16,
)
return model, tokenizer
Inference Function
Here’s the inference function. By adding the @endpoint()
decorator to it, we can expose this function as a RESTful API.
Note the secrets
argument which ensures the Huggingface token is loaded into the environment.
@endpoint(
secrets=["HF_TOKEN"],
on_start=load_models,
name="meta-llama-3-8b-instruct",
cpu=2,
memory="32Gi",
gpu="A100-40",
image=Image(
python_version="python3.9",
python_packages=["torch", "transformers", "accelerate"],
),
volumes=[
Volume(
name="cached_models",
mount_path=BEAM_VOLUME_PATH,
)
],
)
def generate_text(context, **inputs):
# Retrieve model and tokenizer from on_start
model, tokenizer = context.on_start_value
# Inputs passed to API
messages = inputs.pop("messages", None)
if not messages:
return {"error": "Please provide messages for text generation."}
generate_args = {
"max_length": inputs.get("max_tokens", MAX_LENGTH),
"temperature": inputs.get("temperature", TEMPERATURE),
"top_p": inputs.get("top_p", TOP_P),
"top_k": inputs.get("top_k", TOP_K),
"repetition_penalty": inputs.get("repetition_penalty", REPETITION_PENALTY),
"no_repeat_ngram_size": inputs.get("no_repeat_ngram_size", NO_REPEAT_NGRAM_SIZE),
"do_sample": inputs.get("do_sample", DO_SAMPLE),
"use_cache": True,
"eos_token_id": tokenizer.eos_token_id,
"pad_token_id": tokenizer.pad_token_id,
}
model_inputs = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(model_inputs, return_tensors="pt", padding=True)
input_ids = inputs["input_ids"].to("cuda")
attention_mask = inputs["attention_mask"].to("cuda")
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_mask, **generate_args
)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"output": output_text}
Serving the API
In your shell, serve the API by running:
beam serve app.py:generate_text
This command will:
- Spin up a container.
- Run it with the specified CPU, memory, and GPU resources.
- Sync your local files to the remote container.
- Print a cURL request to invoke the API.
- Stream logs to your shell.
Invoking the API
Once the API is running, you can invoke it using the following cURL command:
curl -X POST 'https://app.beam.cloud/endpoint/id/[ENDPOINT-ID]' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer [AUTH-TOKEN]' \
-d '{
"messages": [
{"role": "system", "content": "You are a yoda chatbot who always responds in yoda speak!"},
{"role": "user", "content": "Who are you?"}
]
}'
Replace [ENDPOINT-ID] with your actual endpoint ID and [AUTH-TOKEN] with your authentication token.
Deploy to Production
The beam serve command is used for temporary APIs. When you’re ready to move to production, deploy a persistent endpoint:
beam deploy app.py:generate_text
Summary
You’ve successfully set up a highly performant serverless API for generating text using the Meta Llama 3 8B Instruct model on Beam.
Was this page helpful?