The first thing we’ll do is setup an Image with the Python packages required for this app.
Because this script will run remotely, we need to make sure our local Python interpreter doesn’t try to install these packages locally.
We’ll use the if env.is_remote() flag to conditionally import the Python packages only when the script is running remotely on Beam.
app.py
from beam import Image, Volume, endpoint, Output, env# This check ensures that the packages are only imported when running this script remotely on Beamif env.is_remote():from diffusers import StableDiffusionXLPipeline, EulerAncestralDiscreteSchedulerimport torchfrom huggingface_hub import hf_hub_downloadfrom safetensors.torch import load_fileimport osimport uuid# The container image for the remote runtimeimage = Image( python_version="python3.9", python_packages=["diffusers[torch]>=0.10","transformers","huggingface_hub","torch","peft","pillow","accelerate","safetensors","xformers",],)
Next, we’ll set up a function to run once when the container first starts up. This allows us to cache the model in memory between requests and ensures we don’t unnecessarily re-load the model.
app.py
CACHE_PATH ="./models"MODEL_URL ="https://huggingface.co/martyn/sdxl-turbo-mario-merge-top-rated/blob/main/topRatedTurboxlLCM_v10.safetensors"LORA_WEIGHT_NAME ="raw.safetensors"LORA_REPO ="ntc-ai/SDXL-LoRA-slider.raw"# This function once when the container first bootsdefload_models(): hf_hub_download(repo_id=LORA_REPO, filename=LORA_WEIGHT_NAME, cache_dir=CACHE_PATH) pipe = StableDiffusionXLPipeline.from_single_file( MODEL_URL, torch_dtype=torch.float16, safety_checker=None, cache_dir=CACHE_PATH,).to("cuda")return pipe
Here’s our inference function. By adding the @endpoint() decorator to it, we can expose this function as a RESTful API.
There are a few things to take note of:
an image with the Python requirements we defined above
an on_start function that runs once when the container first boots. The value from on_start (in this case, our pipe handler) is available in the inference function using the context value: pipe = context.on_start_value
volumes, which are used to store the downloaded LoRAs and model weights on Beam
keep_warm_seconds, which tells Beam how long to keep the container running between requests
app.py
@endpoint( image=image, on_start=load_models, keep_warm_seconds=60, cpu=2, memory="32Gi", gpu="A10G", volumes=[Volume(name="models", mount_path=CACHE_PATH)],)defgenerate(context, prompt="medieval rich kingpin sitting in a tavern, raw"):# Retrieve pre-loaded model from loader pipe = context.on_start_value pipe.enable_sequential_cpu_offload() pipe.enable_attention_slicing("max") pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)# Use a unique adapter name adapter_name =f"raw_{uuid.uuid4().hex}"# Load and activate the LoRA from a local path pipe.load_lora_weights( LORA_REPO, weight_name=LORA_WEIGHT_NAME, adapter_name=adapter_name)# Activate the LoRA pipe.set_adapters(["raw"], adapter_weights=[2.0])# Generate image image = pipe( prompt, negative_prompt="nsfw", width=512, height=512, guidance_scale=2, num_inference_steps=10,).images[0]# Save image file output = Output.from_pil_image(image).save()# Retrieve pre-signed URL for output file url = output.public_url()return{"image": url}