This guide demonstrates how to deploy a Text-to-Speech (TTS) API using the Zonos model from Zyphra. The API converts input text into spoken audio, leveraging a pre-trained transformer model and speaker embeddings derived from an example audio file. We use Beam’s infrastructure for compute and file output handling.
View the Code See the full code for this example on GitHub.
Environment Configuration
First, create a file named app.py
:
from beam import Image, endpoint, Output, env
if env.is_remote():
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
from zonos.utils import DEFAULT_DEVICE as device
import os
import uuid
# Custom image configuration
image = (
Image(
base_image = "nvidia/cuda:12.4.1-devel-ubuntu22.04" ,
python_version = "python3.11"
)
.add_commands([ "apt update && apt install -y espeak-ng git" ])
.add_commands([
"pip install -U uv" ,
"git clone https://github.com/Zyphra/Zonos.git /tmp/Zonos" ,
"cd /tmp/Zonos && pip install setuptools wheel && pip install -e ." ,
])
)
@endpoint (
name = "zonos-tts" ,
image =image,
cpu = 12 ,
memory = "32Gi" ,
gpu = "A100-40" ,
timeout =- 1
)
def generate (** inputs ):
text = inputs.get( "text" )
if not text:
return { "error" : "Please provide a text" }
os.chdir( "/tmp/Zonos" )
model = Zonos.from_pretrained( "Zyphra/Zonos-v0.1-transformer" , device =device)
wav, sampling_rate = torchaudio.load( "assets/exampleaudio.mp3" )
speaker = model.make_speaker_embedding(wav, sampling_rate)
cond_dict = make_cond_dict( text =text, speaker =speaker, language = "en-us" )
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
# Save generated audio
file_name = f "/tmp/zonos_out_ { uuid.uuid4() } .wav"
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save(file_name, wavs[ 0 ], model.autoencoder.sampling_rate)
# Upload and get public URL
output_file = Output( path =file_name)
output_file.save()
public_url = output_file.public_url( expires = 1200000000 )
return { "output_url" : public_url}
if __name__ == "__main__" :
generate()
Deployment
Run this command to deploy the endpoint:
beam deploy app.py:generate
It will return a URL with the endpoint:
= > Building image
= > Syncing files
= > Deploying
= > Deployed 🎉
= > Invocation details
curl -X POST 'https://app.beam.cloud/endpoint/zonos-tts/v1' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer {YOUR_AUTH_TOKEN}' \
-d '{"text": "On Beam run AI workloads anywhere with zero complexity."}'
API Usage
The deployed endpoint accepts POST requests with a JSON payload containing the text to convert to speech.
{
"text" : "Your text to convert to speech"
}
Example Request
curl -X POST 'https://app.beam.cloud/endpoint/zonos-tts/v1' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer {YOUR_AUTH_TOKEN}' \
-d '{"text": "On Beam run AI workloads anywhere with zero complexity. One line of Python, global GPUs, full control"}'
Example Response
The API returns a JSON object with a URL to the generated audio file:
{
"output_url" : "https://app.beam.cloud/output/id/704defd0-9370-4499-9124-677925e64961"
}