This guide demonstrates how to set up and run the Parler TTS text-to-speech model as a serverless API on Beam.
View the Code See the code for this example on Github.
Introduction
Parler-TTS Mini is a lightweight text-to-speech (TTS) model, trained on 45K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt. This guide explains how to deploy and use it on Beam.
Deployment Setup
Define the model and its dependencies using the parlertts_image
:
from beam import endpoint, env, Image, Output
if env.is_remote():
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import uuid
def load_models ():
model = ParlerTTSForConditionalGeneration.from_pretrained(
"parler-tts/parler-tts-mini-v1" ).to( "cuda:0" )
tokenizer = AutoTokenizer.from_pretrained( "parler-tts/parler-tts-mini-v1" )
return model, tokenizer
parlertts_image = (
Image(
python_version = "python3.10" ,
python_packages = [
"torch" ,
"transformers" ,
"soundfile" ,
"Pillow" ,
"wheel" ,
"packaging" ,
"ninja" ,
"huggingface_hub[hf-transfer]" ,
],
)
.add_commands(
[
"apt update && apt install git -y" ,
"pip install git+https://github.com/huggingface/parler-tts.git" ,
]
)
.with_envs( "HF_HUB_ENABLE_HF_TRANSFER=1" )
)
Inference Function
The generate_speech
function processes text and generates speech audio:
@endpoint (
name = "parler-tts" ,
on_start = load_models,
cpu = 2 ,
memory = "32Gi" ,
gpu = "A10G" ,
gpu_count = 2 ,
image = parlertts_image
)
def generate_speech ( context , ** inputs ):
model, tokenizer = context.on_start_value
prompt = inputs.pop( "prompt" , None )
description = inputs.pop( "description" , None )
if not prompt or not description:
return { "error" : "Please provide a prompt and description" }
device = "cuda:0"
input_ids = tokenizer(
description, return_tensors = "pt" ).input_ids.to(device)
prompt_input_ids = tokenizer(
prompt, return_tensors = "pt" ).input_ids.to(device)
generation = model.generate(
input_ids = input_ids, prompt_input_ids = prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
file_name = f "/tmp/parler_tts_out_ { uuid.uuid4() } .wav"
sf.write(file_name, audio_arr, model.config.sampling_rate)
output_file = Output( path = file_name)
output_file.save()
public_url = output_file.public_url( expires = 1200000000 )
print (public_url)
return { "output_url" : public_url}
Deployment
Deploy the API to Beam:
beam deploy app.py:generate_speech
API Usage
Send a POST
request with the following JSON payload:
{
"prompt" : "Your text to convert to speech" ,
"description" : "Description of the voice/style"
}
Example Request
{
"prompt" : "On Beam run AI workloads anywhere with zero complexity. One line of Python, global GPUs, full control!!!" ,
"description" : "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
}
Example Response
A generated audio file will be returned:
{
"output_url" : "https://app.beam.cloud/output/id/dc443a80-7fcc-42bc-928b-4605e41b0825"
}
Audio Example
Here’s an example of the generated audio output:
Summary
You’ve successfully deployed a Parler TTS text-to-speech API using Beam.