Pygmalion 6B
This example shows how you can deploy a serverless inference API for Pygmalion 6B.
What is Pygmalion
You can use Pymgalion as a text generation model, but it’s designed for dialogues. It will perform best when prompts are passed in using the following format:
[CHARACTER]'s Persona: [A few sentences about the character you want the model to play]
<START>
[DIALOGUE HISTORY]
You: [Your input message here]
[CHARACTER]:
Running Pygmalion 6B on a remote GPU
First, you’ll setup your compute environment. You’ll specify:
- Compute requirements
- Python packages to install in the runtime
This example runs on an A10G with 24Gi of GPU VRAM.
from beam import App, Runtime, Image, Volume, RequestLatencyAutoscaler
CACHE_PATH = "./cached_models"
app = App(
name="pygmalion",
runtime=Runtime(
cpu=2,
memory="32Gi",
gpu="A10G",
image=Image(
python_version="python3.9",
python_packages=[
"transformers",
"accelerate",
"torch",
"bitsandbytes",
"scipy",
"protobuf",
], # You can also add a path to a requirements.txt instead
),
),
# Storage volume for cached models
volumes=[Volume(name="cached_models", path=CACHE_PATH)],
)
How much VRAM does Pygmalion 6B use?
Pygmalion uses around 12Gi of GPU VRAM and 3Gi of RAM.
After you’ve deployed your app, you can monitor the compute resources used in the web dashboard. You can use this info to calibrate the amount of resources to request in your app.

Inference API
This app includes a few things:
Volume
to store the cached modelsLoader
function to pre-load models onto disk when the container first startsRequestLatencyAutoscaler
to scale up additional servers when the API is under loadREST API
trigger to deploy thepredict
function as a serverless REST API
from beam import App, Runtime, Image, Volume, RequestLatencyAutoscaler
from transformers import AutoTokenizer, AutoModelForCausalLM
# Beam volume to store cached models
CACHE_PATH = "./cached_models"
app = App(
name="pygmalion",
runtime=Runtime(
cpu=2,
memory="32Gi",
gpu="A10G",
image=Image(
python_version="python3.9",
python_packages=[
"transformers",
"accelerate",
"torch",
"bitsandbytes",
"scipy",
"protobuf",
], # You can also add a path to a requirements.txt instead
),
),
# Storage volume for cached models
volumes=[Volume(name="cached_models", path=CACHE_PATH)],
)
# Pre-load models: this function runs once when the container boots
def load_models():
tokenizer = AutoTokenizer.from_pretrained(
"PygmalionAI/pygmalion-6b", cache_dir=CACHE_PATH
)
model = AutoModelForCausalLM.from_pretrained(
"PygmalionAI/pygmalion-6b",
load_in_8bit=True,
device_map="auto",
cache_dir=CACHE_PATH,
)
return model, tokenizer
# Autoscale by request latency - will spin up add'l replicas if latency exceeds 30s
autoscaler = RequestLatencyAutoscaler(desired_latency=30, max_replicas=5)
# Rest API initialized with loader and autoscaler
@app.rest_api(loader=load_models, autoscaler=autoscaler)
def predict(**inputs):
# Retrieve cached model from loader
model, tokenizer = inputs["context"]
# Input from API request
prompt = inputs["prompt"]
# Inference
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids.to("cuda"), max_length=30)
result = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(result)
return {"prediction": result}
Running Inference
You can run this example with your own custom prompt by using the beam serve
command:
beam serve app.py
You’ll see terminal logs similar to those below:
(.venv) beta9@MacBook-Air-3 pygmalian % beam serve app.py
i Using cached image.
✓ App initialized.
i Uploading files...
✓ Container scheduled, logs will appear below.
⠧ Starting container... 5s (Estimated: 3m20s)
================= Call the API =================
curl -X POST 'https://apps.beam.cloud/serve/5w6mr/652f4476c4545d0009885dfa' \
-H 'Accept: */*' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Connection: keep-alive' \
-H 'Authorization: Basic [AUTH TOKEN]' \
-H 'Content-Type: application/json' \
-d '{}'
============= Logs Streamed Below ==============
INFO: | Starting deployment.
INFO: | Starting task worker[0]
INFO: | Starting app...
INFO: | Loading handler in 'app.py:predict'...
INFO: | Running loader in 'app.py:load_models'...
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|##### | 1/2 [00:46<00:46, 46.20s/it]
Loading checkpoint shards: 100%|##########| 2/2 [01:06<00:00, 33.23s/it]
INFO: | Ready for tasks.
You can copy the cURL command to invoke the API, using your own input payload:
curl -X POST 'https://apps.beam.cloud/serve/5w6mr/652f4476c4545d0009885dfa' \
-H 'Accept: */*' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Connection: keep-alive' \
-H 'Authorization: Basic [AUTH TOKEN]' \
-H 'Content-Type: application/json' \
-d '{"prompt": "YOUR PROMPT"}'
Deployment
You might want to deploy this as a REST API. It’s simple to do so. Just run beam deploy
:
beam deploy app.py
Your Beam Dashboard will open in a browser window, and you can monitor the deployment status in the web UI.
Calling the API
You’ll call the API by pasting in the cURL command displayed in the browser window.
curl -X POST --compressed "https://apps.beam.cloud/cjm9u" \
-H 'Accept: */*' \
-H 'Accept-Encoding: gzip, deflate' \
-H 'Authorization: Basic [ADD_YOUR_AUTH_TOKEN]' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-d '{"prompt": "me"}'
The API will return an dialogue, like this:
{ "prediction": "Me: I'm sorry, I didn't mean to hurt you. I was just trying to get your attention. You: It's okay"}
Was this page helpful?