You’ll start by adding an endpoint decorator with an Image
Endpoint is the wrapper for your inference function.
Inside the endpoint is an Image. The Image defines the image your container will run on.
If you’d like to make further customizations to your image — such as adding
shell commands — you can do so using the commands argument. Read more
about custom images.
Typically, your apps that run on Beam will be using packages that you don’t have installed locally.Some of our Python packages aren’t installed locally — like Transformers — so we’ll use a special flag called env.is_remote() to conditionally import packages only when inside the remote cloud environment.
from beam import envif env.is_remote(): import transformers import torch
This command checks whether the Python script is running remotely on Beam, and will only try to import the packages in its scope if it is.
We’ll create a new function to run inference on facebook/opt-125m via Huggingface Transformers.Since we’ll deploy this as a REST API, we add an @endpoint decorator above the inference function:
Beam includes a live-reloading feature that allows you to run your code on the same environment you’ll be running in production.
By default, Beam will sync all the files in your working directory to the
remote container. This allows you to use the files you have locally while
developing. If you want to prevent some files from getting uploaded, you can
create a .beamignore.
In your shell, run beam serve app.py:predict. This will:
Spin up a container
Run it on a GPU
Print a cURL request to invoke the API
Stream the logs to your shell
You should keep this terminal window open while developing.
Now, head back to your IDE, and change a line of code. Hit save.If you look closely at the shell running beam serve, you’ll notice the server reloading with your code changes.You’ll use this workflow anytime you’re developing an app on Beam. Trust us — it makes the development process uniquely fast and painless.
If you called the API via the cURL command, you’ll notice that your model was downloaded each time you invoked the API.In order to improve performance, we’ll setup a function to pre-load your models and store them on disk between API calls.
Beam includes an on_start method, which you can pass to your function decorators. on_start is run exactly once when the container first starts:The value of the on_start function can be retrieved from context.on_start_value:
from beam import Image, endpoint, envif env.is_remote(): from transformers import AutoTokenizer, OPTForCausalLM import torchdef download_models(): from transformers import AutoTokenizer, OPTForCausalLM model = OPTForCausalLM.from_pretrained("facebook/opt-125m") tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m") return model, tokenizer@endpoint( name="inference-quickstart", on_start=download_models, image=Image( python_version="python3.9", python_packages=[ "transformers", "torch", ], ),)def predict(context): # Retrieve cached model from on_start function model, tokenizer = context.on_start_value # Do something with the model and tokenizer...
The on_start method saves us from having to download the model multiple times, but we can avoid downloading the model entirely by caching it in a Storage Volume:Beam allows you to create highly-available storage volumes that can be used across tasks. You might use volumes for things like storing model weights or large datasets.
from beam import Image, endpoint, Volume# Model weights will be cached in this folderCACHE_PATH = "./weights"# This function runs once when the container first startsdef download_models(): from transformers import AutoTokenizer, OPTForCausalLM model = OPTForCausalLM.from_pretrained("facebook/opt-125m", cache_dir=CACHE_PATH) tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", cache_dir=CACHE_PATH) return model, tokenizer@endpoint( name="inference-quickstart", on_start=download_models, volumes=[Volume(name="weights", mount_path=CACHE_PATH)], cpu=1, memory="16Gi", gpu="T4", image=Image( python_version="python3.9", python_packages=[ "transformers", "torch", ], ),)
Now, these models can be automatically downloaded to the volume by using the cache_dir argument in transformers:
model = OPTForCausalLM.from_pretrained("facebook/opt-125m", cache_dir=CACHE_PATH)tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", cache_dir=CACHE_PATH)
These volumes are mounted directly to the container running your app, so you can read and write them to disk like any normal file.
With these performance optimizations in place, it’s time to deploy your API to create a persistent endpoint. In your shell, run this command to deploy your app: