This guide demonstrates how to set up and run the Parler TTS text-to-speech model as a serverless API on Beam.

View the Code

See the code for this example on Github.

Introduction

Parler-TTS Mini is a lightweight text-to-speech (TTS) model, trained on 45K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt. This guide explains how to deploy and use it on Beam.

Deployment Setup

Define the model and its dependencies using the parlertts_image:

from beam import endpoint, env, Image, Output

if env.is_remote():
    from parler_tts import ParlerTTSForConditionalGeneration
    from transformers import AutoTokenizer
    import soundfile as sf
    import uuid

def load_models():
    model = ParlerTTSForConditionalGeneration.from_pretrained(
        "parler-tts/parler-tts-mini-v1").to("cuda:0")
    tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
    return model, tokenizer

parlertts_image = (
    Image(
        python_version="python3.10",
        python_packages=[
            "torch",
            "transformers",
            "soundfile",
            "Pillow",
            "wheel",
            "packaging",
            "ninja",
            "huggingface_hub[hf-transfer]",
        ],
    )
    .add_commands(
        [
            "apt update && apt install git -y",
            "pip install git+https://github.com/huggingface/parler-tts.git",
        ]
    )
    .with_envs("HF_HUB_ENABLE_HF_TRANSFER=1")
)

Inference Function

The generate_speech function processes text and generates speech audio:

@endpoint(
    name="parler-tts",
    on_start=load_models,
    cpu=2,
    memory="32Gi",
    gpu="A10G",
    gpu_count=2,
    image=parlertts_image
)
def generate_speech(context, **inputs):
    model, tokenizer = context.on_start_value

    prompt = inputs.pop("prompt", None)
    description = inputs.pop("description", None)

    if not prompt or not description:
        return {"error": "Please provide a prompt and description"}
    
    device = "cuda:0"

    input_ids = tokenizer(
        description, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = tokenizer(
        prompt, return_tensors="pt").input_ids.to(device)

    generation = model.generate(
        input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()

    file_name = f"/tmp/parler_tts_out_{uuid.uuid4()}.wav"

    sf.write(file_name, audio_arr, model.config.sampling_rate)
   
    output_file = Output(path=file_name)
    output_file.save()
    public_url = output_file.public_url(expires=1200000000)
    print(public_url)
    return {"output_url": public_url}

Deployment

Deploy the API to Beam:

beam deploy app.py:generate_speech

API Usage

Send a POST request with the following JSON payload:

{
    "prompt": "Your text to convert to speech",
    "description": "Description of the voice/style"
}

Example Request

{
    "prompt": "On Beam run AI workloads anywhere with zero complexity. One line of Python, global GPUs, full control!!!",
    "description": "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
}

Example Response

A generated audio file will be returned:

{
    "output_url": "https://app.beam.cloud/output/id/dc443a80-7fcc-42bc-928b-4605e41b0825"
}

Audio Example

Here’s an example of the generated audio output:

Summary

You’ve successfully deployed a Parler TTS text-to-speech API using Beam.