Pygmalion 6B
This example shows how you can deploy a serverless inference API for Pygmalion 6B.
What is Pygmalion
You can use Pymgalion as a text generation model, but it’s designed for dialogues. It will perform best when prompts are passed in using the following format:
Running Pygmalion 6B on a remote GPU
First, you’ll setup your compute environment. You’ll specify:
- Compute requirements
- Python packages to install in the runtime
This example runs on an A10G with 24Gi of GPU VRAM.
How much VRAM does Pygmalion 6B use?
Pygmalion uses around 12Gi of GPU VRAM and 3Gi of RAM.
After you’ve deployed your app, you can monitor the compute resources used in the web dashboard. You can use this info to calibrate the amount of resources to request in your app.
Inference API
This app includes a few things:
Volume
to store the cached modelsLoader
function to pre-load models onto disk when the container first startsREST API
trigger to deploy thepredict
function as a serverless REST API
Running Inference
You can run this example with your own custom prompt by using the beam serve
command:
You’ll see terminal logs similar to those below:
You can copy the cURL command to invoke the API, using your own input payload:
Deployment
You might want to deploy this as a REST API. It’s simple to do so. Just run beam deploy
:
Your Beam Dashboard will open in a browser window, and you can monitor the deployment status in the web UI.
Calling the API
You’ll call the API by pasting in the cURL command displayed in the browser window.
The API will return an dialogue, like this:
Was this page helpful?