Text-to-Video with Mochi

This guide demonstrates how to run the Mochi-1 text-to-video model on Beam. Mochi-1 is a powerful model for generating high-quality videos based on text prompts.

View the Code

See the code for this example on Github.

Introduction

Mochi-1 is a state-of-the-art text-to-video model. This guide will help you deploy and use the model as a serverless API on Beam.

Upload Model Weights

Before using the Mochi-1 model, you need to upload its weights to Beam. This is handled by the upload.py script:

from beam import function, Volume, Image, env

if env.is_remote():
    from huggingface_hub import snapshot_download

VOLUME_PATH = "./mochi-1-preview"

@function(
    image=Image(
        python_packages=["huggingface_hub", "huggingface_hub[hf_xet]"]
    ),
    memory="32Gi",
    cpu=4,
    secrets=["HF_TOKEN"],
    volumes=[Volume(name="mochi-1-preview", mount_path=VOLUME_PATH)],
)
def upload():
    snapshot_download(
        repo_id="genmo/mochi-1-preview",
        local_dir=f"{VOLUME_PATH}/weights"
    )
    print("Files uploaded successfully")

if __name__ == "__main__":
    upload()

Steps to Run the Script

Run the script locally to upload the weights:

python upload.py

Once the weights are uploaded, the generate_video endpoint can access them for inference.

Setup Remote Environment

The model and its dependencies are defined in the mochi_image. Here’s how it’s configured:

from beam import endpoint, env, Volume, Image, Output

VOLUME_PATH = "./mochi-1-preview"

if env.is_remote():
    import torch
    from diffusers import MochiPipeline
    from diffusers.utils import export_to_video
    import uuid

def load_models():
    pipe = MochiPipeline.from_pretrained(
        f"{VOLUME_PATH}/weights", variant="bf16", torch_dtype=torch.bfloat16)
    return pipe

The mochi_image includes all necessary Python packages and system dependencies:

mochi_image = (
    Image(
        python_version="python3.11",
        python_packages=["torch", "transformers", "accelerate",
                         "sentencepiece", "imageio-ffmpeg", "imageio", "ninja"]
    )
    .add_commands(["apt update && apt install git -y", "pip install git+https://github.com/huggingface/diffusers.git"])
)

Inference Function

The generate_video function processes text prompts and generates a video:

@endpoint(
    name="mochi-1-preview",
    on_start=load_models,
    cpu=4,
    memory="32Gi",
    gpu="A10G",
    gpu_count=2,
    image=mochi_image,
    volumes=[Volume(name="mochi-1-preview", mount_path=VOLUME_PATH)],
    timeout=-1
)
def generate_video(context, **inputs):
    pipe = context.on_start_value

    prompt = inputs.pop("prompt", None)

    if not prompt:
        return {"error": "Please provide a prompt"}

    pipe.enable_model_cpu_offload()
    pipe.enable_vae_tiling()
    frames = pipe(prompt, num_frames=40).frames[0]

    file_name = f"/tmp/mochi_out_{uuid.uuid4()}.mp4"

    export_to_video(frames, file_name, fps=15)

    output_file = Output(path=file_name)
    output_file.save()
    public_url = output_file.public_url(expires=-1)
    print(public_url)
    return {"output_url": public_url}

Deployment

Deploy the API to Beam:

beam deploy app.py:generate_video

Invoking the API

To invoke the API, send a POST request with the following payload:

{
  "prompt": "The camera follows behind a rugged green Jeep with a black snorkel as it speeds along a narrow dirt trail cutting through a dense jungle. Thick vines hang from towering trees with sprawling canopies, their leaves forming a vibrant green tunnel above the vehicle. Mud splashes up from the Jeep’s tires as it powers through a shallow stream crossing the path. Sunlight filters through gaps in the trees, casting dappled golden light over the scene. The dirt trail twists sharply into the distance, overgrown with wild ferns and tropical plants. The vehicle is seen from the rear, leaning into the curve as it maneuvers through the untamed terrain, emphasizing the adventure of the rugged journey. The surrounding jungle is alive with texture and color, with distant mountains barely visible through the mist and an overcast sky heavy with the promise of rain."
}

Here’s an example of a cURL request:

curl -X POST 'https://app.beam.cloud/endpoint/id/[ENDPOINT-ID]' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer [AUTH-TOKEN]' \
-d '{
    "prompt": "Your text prompt for video generation."
}'

Example Output

The API will return a generated video URL. Here’s an example:

{
  "output_url": "https://app.beam.cloud/output/id/dc443a80-7fcc-42bc-928b-4605e41b0825"
}

Example Video

Here is an example video generated by the Mochi-1 model:

Summary

You’ve successfully deployed and tested a Mochi-1 text-to-video generation API using Beam.

Large Language Models (LLMs)

Image and Video

Audio and Transcription

Web Apps

Agents

Fine-Tuning

Text-to-Video with Mochi

View the Code

Introduction

Upload Model Weights

Steps to Run the Script

Setup Remote Environment

Inference Function

Deployment

Invoking the API

Example Output

Example Video

Summary

Large Language Models (LLMs)

Image and Video

Audio and Transcription

Web Apps

Agents

Fine-Tuning

View the Code

​Introduction

​Upload Model Weights

​Steps to Run the Script

​Setup Remote Environment

​Inference Function

​Deployment

​Invoking the API

​Example Output

​Example Video

​Summary

Introduction

Upload Model Weights

Steps to Run the Script

Setup Remote Environment

Inference Function

Deployment

Invoking the API

Example Output

Example Video

Summary