This example shows how you can deploy a serverless inference API for Pygmalion 6B.

What is Pygmalion

You can use Pymgalion as a text generation model, but it’s designed for dialogues. It will perform best when prompts are passed in using the following format:

[CHARACTER]'s Persona: [A few sentences about the character you want the model to play]
<START>
[DIALOGUE HISTORY]
You: [Your input message here]
[CHARACTER]:

Running Pygmalion 6B on a remote GPU

First, you’ll setup your compute environment. You’ll specify:

  • Compute requirements
  • Python packages to install in the runtime

This example runs on an A10G with 24Gi of GPU VRAM.

app.py
from beam import App, Runtime, Image, Volume, RequestLatencyAutoscaler

CACHE_PATH = "./cached_models"

app = App(
    name="pygmalion",
    runtime=Runtime(
        cpu=2,
        memory="32Gi",
        gpu="A10G",
        image=Image(
            python_version="python3.9",
            python_packages=[
                "transformers",
                "accelerate",
                "torch",
                "bitsandbytes",
                "scipy",
                "protobuf",
            ],  # You can also add a path to a requirements.txt instead
        ),
    ),
    # Storage volume for cached models
    volumes=[Volume(name="cached_models", path=CACHE_PATH)],
)

How much VRAM does Pygmalion 6B use?

Pygmalion uses around 12Gi of GPU VRAM and 3Gi of RAM.

After you’ve deployed your app, you can monitor the compute resources used in the web dashboard. You can use this info to calibrate the amount of resources to request in your app.

Inference API

This app includes a few things:

  • Volume to store the cached models
  • Loader function to pre-load models onto disk when the container first starts
  • RequestLatencyAutoscaler to scale up additional servers when the API is under load
  • REST API trigger to deploy the predict function as a serverless REST API
run.py
from beam import App, Runtime, Image, Volume, RequestLatencyAutoscaler
from transformers import AutoTokenizer, AutoModelForCausalLM

# Beam volume to store cached models
CACHE_PATH = "./cached_models"

app = App(
    name="pygmalion",
    runtime=Runtime(
        cpu=2,
        memory="32Gi",
        gpu="A10G",
        image=Image(
            python_version="python3.9",
            python_packages=[
                "transformers",
                "accelerate",
                "torch",
                "bitsandbytes",
                "scipy",
                "protobuf",
            ],  # You can also add a path to a requirements.txt instead
        ),
    ),
    # Storage volume for cached models
    volumes=[Volume(name="cached_models", path=CACHE_PATH)],
)


# Pre-load models: this function runs once when the container boots
def load_models():
    tokenizer = AutoTokenizer.from_pretrained(
        "PygmalionAI/pygmalion-6b", cache_dir=CACHE_PATH
    )
    model = AutoModelForCausalLM.from_pretrained(
        "PygmalionAI/pygmalion-6b",
        load_in_8bit=True,
        device_map="auto",
        cache_dir=CACHE_PATH,
    )
    return model, tokenizer


# Autoscale by request latency - will spin up add'l replicas if latency exceeds 30s
autoscaler = RequestLatencyAutoscaler(desired_latency=30, max_replicas=5)


# Rest API initialized with loader and autoscaler
@app.rest_api(loader=load_models, autoscaler=autoscaler)
def predict(**inputs):
    # Retrieve cached model from loader
    model, tokenizer = inputs["context"]
    # Input from API request
    prompt = inputs["prompt"]

    # Inference
    inputs = tokenizer(prompt, return_tensors="pt")
    generate_ids = model.generate(inputs.input_ids.to("cuda"), max_length=30)
    result = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    print(result)

    return {"prediction": result}

Running Inference

You can run this example with your own custom prompt by using the beam serve command:

beam serve app.py

You’ll see terminal logs similar to those below:

(.venv) beta9@MacBook-Air-3 pygmalian % beam serve app.py
 i  Using cached image.
 ✓  App initialized.
 i  Uploading files...
 ✓  Container scheduled, logs will appear below.
⠧ Starting container... 5s (Estimated: 3m20s)

================= Call the API =================

curl -X POST 'https://apps.beam.cloud/serve/5w6mr/652f4476c4545d0009885dfa' \
-H 'Accept: */*' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Connection: keep-alive' \
-H 'Authorization: Basic [AUTH TOKEN]' \
-H 'Content-Type: application/json' \
-d '{}'

============= Logs Streamed Below ==============

INFO:     | Starting deployment.
INFO:     | Starting task worker[0]
INFO:     | Starting app...
INFO:     | Loading handler in 'app.py:predict'...
INFO:     | Running loader in 'app.py:load_models'...

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|#####     | 1/2 [00:46<00:46, 46.20s/it]
Loading checkpoint shards: 100%|##########| 2/2 [01:06<00:00, 33.23s/it]
INFO:     | Ready for tasks.

You can copy the cURL command to invoke the API, using your own input payload:

curl -X POST 'https://apps.beam.cloud/serve/5w6mr/652f4476c4545d0009885dfa' \
-H 'Accept: */*' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Connection: keep-alive' \
-H 'Authorization: Basic [AUTH TOKEN]' \
-H 'Content-Type: application/json' \
-d '{"prompt": "YOUR PROMPT"}'

Deployment

You might want to deploy this as a REST API. It’s simple to do so. Just run beam deploy:

beam deploy app.py

Your Beam Dashboard will open in a browser window, and you can monitor the deployment status in the web UI.

Calling the API

You’ll call the API by pasting in the cURL command displayed in the browser window.

 curl -X POST --compressed "https://apps.beam.cloud/cjm9u" \
   -H 'Accept: */*' \
   -H 'Accept-Encoding: gzip, deflate' \
   -H 'Authorization: Basic [ADD_YOUR_AUTH_TOKEN]' \
   -H 'Connection: keep-alive' \
   -H 'Content-Type: application/json' \
   -d '{"prompt": "me"}'

The API will return an dialogue, like this:

{ "prediction": "Me: I'm sorry, I didn't mean to hurt you. I was just trying to get your attention. You: It's okay"}