This guide demonstrates how to deploy a Text-to-Speech (TTS) API using the Zonos model from Zyphra. The API converts input text into spoken audio, leveraging a pre-trained transformer model and speaker embeddings derived from an example audio file. We use Beam’s infrastructure for compute and file output handling.

View the Code

See the full code for this example on GitHub.

Setup

Environment Configuration

First, create a file named app.py:

from beam import Image, endpoint, Output, env

if env.is_remote():
    import torchaudio
    from zonos.model import Zonos
    from zonos.conditioning import make_cond_dict
    from zonos.utils import DEFAULT_DEVICE as device
    import os
    import uuid

# Custom image configuration
image = (
    Image(
        base_image="nvidia/cuda:12.4.1-devel-ubuntu22.04",
        python_version="python3.11"
    )
    .add_commands(["apt update && apt install -y espeak-ng git"])
    .add_commands([
        "pip install -U uv",
        "git clone https://github.com/Zyphra/Zonos.git /tmp/Zonos",
        "cd /tmp/Zonos && pip install setuptools wheel && pip install -e .",
    ])
)

@endpoint(
    name="zonos-tts",
    image=image,
    cpu=12,
    memory="32Gi",
    gpu="A100-40",
    timeout=-1
)
def generate(**inputs):
    text = inputs.get("text")

    if not text:
        return {"error": "Please provide a text"}

    os.chdir("/tmp/Zonos")

    model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device=device)

    wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
    speaker = model.make_speaker_embedding(wav, sampling_rate)

    cond_dict = make_cond_dict(text=text, speaker=speaker, language="en-us")
    conditioning = model.prepare_conditioning(cond_dict)

    codes = model.generate(conditioning)

    # Save generated audio
    file_name = f"/tmp/zonos_out_{uuid.uuid4()}.wav"
    wavs = model.autoencoder.decode(codes).cpu()
    torchaudio.save(file_name, wavs[0], model.autoencoder.sampling_rate)

    # Upload and get public URL
    output_file = Output(path=file_name)
    output_file.save()
    public_url = output_file.public_url(expires=1200000000)

    return {"output_url": public_url}

if __name__ == "__main__":
    generate()

Deployment

Run this command to deploy the endpoint:

beam deploy app.py:generate

It will return a URL with the endpoint:

=> Building image
=> Syncing files
=> Deploying
=> Deployed 🎉
=> Invocation details
curl -X POST 'https://app.beam.cloud/endpoint/zonos-tts/v1' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer {YOUR_AUTH_TOKEN}' \
-d '{"text": "On Beam run AI workloads anywhere with zero complexity."}'

API Usage

The deployed endpoint accepts POST requests with a JSON payload containing the text to convert to speech.

Request Format

{
  "text": "Your text to convert to speech"
}

Example Request

curl -X POST 'https://app.beam.cloud/endpoint/zonos-tts/v1' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer {YOUR_AUTH_TOKEN}' \
-d '{"text": "On Beam run AI workloads anywhere with zero complexity. One line of Python, global GPUs, full control"}'

Example Response

The API returns a JSON object with a URL to the generated audio file:

{
  "output_url": "https://app.beam.cloud/output/id/704defd0-9370-4499-9124-677925e64961"
}