You can scale out your app to multiple containers by adding autoscaling.

Scaling out horizontally (adding more containers)

When you deploy a Task Queue or REST API, Beam creates a queueing system that manages each task that’s created when your API is called.

You can configure how Beam will scale based on how many things are in the task queue.

Scale by Queue Depth

Our simplest autoscaling strategy allows you to scale by the number of tasks in the queue.

This allows you to control how many tasks each container can process before scaling up. For example, you could setup an autoscaler to run 30 tasks per container. When you pass 30 tasks in your queue, we will add a container. When you pass 60, we’ll add another containers (up until max_containers is reached).

from beam import QueueDepthAutoscaler, endpoint

autoscaling_config = QueueDepthAutoscaler(
    max_containers=5,
    tasks_per_container=30,
)

@endpoint(autoscaler=autoscaling_config)
def function():
    ...

Increasing throughput in a single container

You can increase throughput for your workloads by configuring the number of workers to launch per container. For example, if you have 4 workers on 1 container, you can run 4 tasks at once.

Workers are especially useful for CPU workloads, since you can increase throughput by adding workers and additional CPU cores, rather than using autoscaling to additional containers.

from beam import Image, QueueDepthAutoscaler, task_queue


@task_queue(
    cpu=4, # 1 CPU core per worker is a good rule of thumb
    workers=4, # Launch 2 workers per container to increase throughput
    image=Image(python_version="python3.8", python_packages=["pandas", "csaps"]),
    autoscaler=QueueDepthAutoscaler(max_containers=5, tasks_per_container=1),
)
def handler():
    import pandas as pd

    print(pd)
    import time

    time.sleep(5)
    return {"result": True}

Workers are always orchestrated together. When the container launches, all the workers launch.

This can result in higher throughput than using multiple containers with horizontal autoscaling.

Worker Use-Cases

Workers allow you to increase your per container throughput, vertically. Autoscaling allows to scale the number of containers and increase throughput horizontally

Each worker will share the CPU, Memory, and GPU defined in your app. This means that you’ll usually need to increase these values in order to utilize more workers.

For example, let’s say your model use 3Gi of GPU memory, and your app is deployed on a T4 GPU with 16Gi of GPU memory. You can safely deploy with 4 workers, since those will fit within the 16Gi of GPU memory available.

It’s not always possible to fit multiple workers onto a single machine. In order to use workers effectively, you’ll need to know how much compute is consumed by your task.

When you’ve added multiple workers, you’ll be able to see each time a new worker is started in your logs: