Llama 2 Inference
Changelog
- May 8, 2025
- May 6, 2025
- May 5, 2025
- May 3, 2025
- May 2, 2025
- May 1, 2025
- April 29, 2025
- April 27, 2025
- April 26, 2025
- April 25, 2025
- April 24, 2025
- April 22, 2025
- April 19, 2025
- April 18, 2025
- April 17, 2025
- April 16, 2025
- April 14, 2025
- April 11, 2025
- April 10, 2025
- April 9, 2025
- April 2, 2025
- March 26, 2025
- March 25, 2025
- March 24, 2025
- March 21, 2025
- March 19, 2025
- March 17, 2025
- March 14, 2025
- March 13, 2025
- March 12, 2025
- March 7, 2025
- March 5, 2025
- March 3, 2025
- February 27, 2025
- February 26, 2025
- February 25, 2025
- February 24, 2025
- February 21, 2025
- February 19, 2025
- February 10, 2025
- February 8, 2025
- February 7, 2025
- February 6, 2025
- February 5, 2025
- February 4, 2025
- February 3, 2025
- February 2, 2025
- February 1, 2025
- January 30, 2025
- January 29, 2025
- January 28, 2025
- January 27, 2025
- January 26, 2025
- January 25, 2025
- January 24, 2025
- January 21, 2025
- January 20, 2025
- January 17, 2025
- January 16, 2025
- January 15, 2025
- January 14, 2025
- January 13, 2025
- January 11, 2025
- January 10, 2025
- January 9, 2025
- January 8, 2025
- January 7, 2025
- January 3, 2025
- January 2, 2025
- December 27, 2024
- December 20, 2024
- December 19, 2024
- December 18, 2024
- December 16, 2024
- December 12, 2024
- December 11, 2024
- December 10, 2024
- December 6, 2024
- December 4, 2024
- December 3, 2024
- November 30, 2024
- November 27, 2024
- November 25, 2024
- November 23, 2024
- November 22, 2024
- November 21, 2024
- November 19, 2024
- November 18, 2024
- November 14, 2024
- November 13, 2024
- November 12, 2024
- November 11, 2024
- November 7, 2024
- November 4, 2024
- November 3, 2024
- November 1, 2024
- October 31, 2024
- October 30, 2024
- October 29, 2024
- October 28, 2024
- October 24, 2024
- October 22, 2024
- October 21, 2024
- October 18, 2024
- October 17, 2024
- October 16, 2024
- October 15, 2024
- October 12, 2024
- October 11, 2024
- October 9, 2024
- October 8, 2024
- October 7, 2024
- September 23, 2024
- September 4, 2024
- August 8, 2024
- July 22, 2024
- July 11, 2024
- July 2, 2024
- June 24, 2024
- June 14, 2024
- Beam V2 Guide & FAQ
Llama 2 Inference
It’s easy to run Llama 2 on Beam. This example runs the 7B parameter model on a 24Gi A10G
GPU, and caches the model weights in a Storage Volume.
Pre-requisites
- Request access to the model on Huggingface
- Add your Huggingface API token to the Beam Secrets Manager, as
HUGGINGFACE_API_KEY
Clone this example
Beam provides a repo of examples, and you can clone this example app by running this command:
beam create-app llama2
cd
into the new llama2
directory.
from beam import App, Runtime, Image, Output, Volume, VolumeType
import os
import torch
from transformers import (
GenerationConfig,
LlamaForCausalLM,
LlamaTokenizer,
)
base_model = "meta-llama/Llama-2-7b-hf"
app = App(
name="llama2",
runtime=Runtime(
cpu=1,
memory="32Gi",
gpu="A10G",
image=Image(
python_packages=[
"accelerate",
"transformers",
"torch",
"sentencepiece",
],
),
),
volumes=[
Volume(
name="model_weights",
path="./model_weights",
volume_type=VolumeType.Persistent,
)
],
)
@app.task_queue(outputs=[Output(path="output.txt")])
def generate(**inputs):
prompt = inputs["prompt"]
tokenizer = LlamaTokenizer.from_pretrained(
base_model,
cache_dir="./model_weights",
use_auth_token=os.environ["HUGGINGFACE_API_KEY"],
)
model = LlamaForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
cache_dir="./model_weights",
use_auth_token=os.environ["HUGGINGFACE_API_KEY"],
)
tokenizer.bos_token_id = 1
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
generation_config = GenerationConfig(
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=4,
max_length=512,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=128,
early_stopping=True,
)
s = generation_output.sequences[0]
decoded_output = tokenizer.decode(s, skip_special_tokens=True).strip()
print(decoded_output)
# Write text output to a text file, which we'll retrieve when the async task completes
output_path = "output.txt"
with open(output_path, "w") as f:
f.write(decoded_output)
Run inference
You can run a single inference on Beam with the beam run
command:
beam run app.py:generate -d '{"prompt": "Summarize rail travel in the United States"}'
Deploy the API
Let’s deploy this as a web API, so we can make requests from the outside world.
beam deploy app.py:generate
After deploying, you’ll be able to copy a cURL request to hit the API.
Was this page helpful?