> ## Documentation Index
> Fetch the complete documentation index at: https://docs.beam.cloud/llms.txt
> Use this file to discover all available pages before exploring further.

# LLaMA 3.1 8B

This guide demonstrates how to run the Meta Llama 3.1 8B Instruct model on Beam.

<Warning>
  You need an access token from Huggingface to run this example. You can sign up
  for Huggingface and access your token on [the settings
  page](https://huggingface.co/settings/tokens), and store it in the [Beam
  Secrets Manager](/v2/environment/secrets).
</Warning>

<Card title="View the Code" icon="github" href="https://github.com/beam-cloud/examples/tree/main/language_models/llama3_8b">
  See the code for this example on Github.
</Card>

## Prerequisites

1. **Request Access**: Request access to the model [here](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
2. **Retrieve HF Token**: Get your Huggingface token from [this page](https://huggingface.co/settings/tokens).
3. **Save HF Token on Beam**: Use the command `beam secret create HF_TOKEN [TOKEN]` to save your token.

## Setup Remote Environment

The first thing we'll do is set up an `Image` with the Python packages required for this app.

We use the `if env.is_remote()` flag to conditionally import the Python packages only when the script is running remotely on Beam.

```python theme={null}
from beam import endpoint, Image, Volume, env

# This ensures that these packages are only loaded when the script is running remotely on Beam
if env.is_remote():
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

# Model parameters
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
MAX_LENGTH = 512
TEMPERATURE = 0.7
TOP_P = 0.9
TOP_K = 50
REPETITION_PENALTY = 1.05
NO_REPEAT_NGRAM_SIZE = 2
DO_SAMPLE = True
NUM_BEAMS = 1
EARLY_STOPPING = True

BEAM_VOLUME_PATH = "./cached_models"

# This runs once when the container first starts
def load_models():
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME,
        cache_dir=BEAM_VOLUME_PATH,
        padding_side='left'
    )
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        device_map="auto",
        torch_dtype=torch.float16,
        cache_dir=BEAM_VOLUME_PATH,
        use_cache=True,
        low_cpu_mem_usage=True
    )
    model.eval()
    return model, tokenizer
```

## Inference Function

Here’s the inference function. By adding the `@endpoint` decorator to it, we can expose this function as a RESTful API.

Note the `secrets` argument which ensures the Huggingface token is loaded into the environment.

```python theme={null}
@endpoint(
    secrets=["HF_TOKEN"],
    on_start=load_models,
    name="meta-llama-3.1-8b-instruct",
    cpu=2,
    memory="16Gi",
    gpu="A10G",
    image=Image(python_version="python3.9")
    .add_python_packages(
        [
            "torch",
            "transformers",
            "accelerate",
            "huggingface_hub[hf-transfer]",
        ]
    )
    .with_envs({
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "TOKENIZERS_PARALLELISM": "false",
        "CUDA_VISIBLE_DEVICES": "0",
    }),
    volumes=[
        Volume(
            name="cached_models",
            mount_path=BEAM_VOLUME_PATH,
        )
    ],
)
def generate_text(context, **inputs):
    # Retrieve model and tokenizer from on_start
    model, tokenizer = context.on_start_value

    # Inputs passed to API
    messages = inputs.pop("messages", None)
    if not messages:
        return {"error": "Please provide messages for text generation."}

    generate_args = {
        "max_new_tokens": inputs.get("max_tokens", MAX_LENGTH),
        "temperature": inputs.get("temperature", TEMPERATURE),
        "top_p": inputs.get("top_p", TOP_P),
        "top_k": inputs.get("top_k", TOP_K),
        "repetition_penalty": inputs.get("repetition_penalty", REPETITION_PENALTY),
        "no_repeat_ngram_size": inputs.get("no_repeat_ngram_size", NO_REPEAT_NGRAM_SIZE),
        "num_beams": inputs.get("num_beams", NUM_BEAMS),
        "early_stopping": inputs.get("early_stopping", EARLY_STOPPING),
        "do_sample": inputs.get("do_sample", DO_SAMPLE),
        "use_cache": True,
        "eos_token_id": tokenizer.eos_token_id,
        "pad_token_id": tokenizer.pad_token_id,
    }

    model_inputs_str = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize inputs with truncation
    tokenized_inputs = tokenizer(
        model_inputs_str,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048
    )
    input_ids = tokenized_inputs["input_ids"].to("cuda")
    attention_mask = tokenized_inputs["attention_mask"].to("cuda")
    input_ids_length = input_ids.shape[-1]

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids, attention_mask=attention_mask, **generate_args
        )

        new_tokens = outputs[0][input_ids_length:]
        output_text = tokenizer.decode(new_tokens, skip_special_tokens=True)

        return {"output": output_text}
```

## Deploy to Production

The following command deploys our code to Beam, and hosts it as a REST API:

```sh theme={null}
beam deploy app.py:generate_text
```

## Invoking the API

Once the API is running, you can invoke it using the following cURL command:

```sh theme={null}
curl -X POST 'https://app.beam.cloud/endpoint/id/[ENDPOINT-ID]' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer [AUTH-TOKEN]' \
-d '{
    "messages": [
        {"role": "system", "content": "You are a yoda chatbot who always responds in yoda speak!"},
        {"role": "user", "content": "Who are you?"}
    ]
}'
```

Replace `[ENDPOINT-ID]` with your actual endpoint ID and `[AUTH-TOKEN]` with your authentication token. You'll see a response from the API, like this:

```json theme={null}
{
  "output": "A Jedi I am. In the ways of the Force, trained I have been."
}
```

## Summary

You've successfully set up a highly performant serverless API for generating text using the Meta Llama 3.1 8B Instruct model on Beam.
