In this example we are going to use vLLM to host an OpenAI compatible Yi-Coder API on Beam.

View the Code

See the code for this example on Github.

vLLM

vLLM is a high-performance, easy-to-use library for LLM inference. It can be up to 24 times faster than HuggingFace’s Transformers library and it allows you to easily setup an OpenAI compatible API for your LLM. Additionally, a number of LLMs (like Llama 3.1) support LoRA. This means that you can easily follow our LoRA guide and host your resulting model using vLLM.

The key to vLLM’s performance is Paged Attention. In LLMs, input tokens produce attention keys and value tensors, which are typically stored in GPU memory. Paged Attention stores these continuous keys and values in non-contiguous memory by partitioning them into blocks that are fetched on a need-to-use basis.

Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc

Yi-Coder

Yi-Coder is an open-source LLM model that delivers state-of-the-art performance on coding tasks with less than 10B parameters. It has knowledge of a wide variety of programming languages.

Offline Inference in a Beam Function

Later in this example, we will be using vLLM to set up a chat API. Before we do that, lets write a simple function to test out the inference engine directly.

To get started, we need to import Beam and vLLM.

from beam import Image, Volume, env, function

# These imports are only available in the remote environment
if env.is_remote():
    from vllm import LLM

vLLM downloads models from HuggingFace and will use the default HF caching directory unless we provide an alternate path with the download_dir argument. We will use a Beam Volume to cache the model.

vllm_cache = Volume(name="yicoder", mount_path="./yicoder")

We will define the environment for our remote function using the @function decorator. For this example, 1 cpu core, 8Gi of memory and an A100-40 will be sufficient. In the function itself, we use the LLM class from vLLM to run offline inference. We specify the model we want to use, the path to download/load the model, and the maximum context length for the model.

@function(
    image=Image().add_python_packages(["vllm"]),
    volumes=[vllm_cache],
    gpu="A100-40",
    memory="8Gi",
    cpu=1,
)
def yicoder(prompt: str):
    llm = LLM(
        model="01-ai/Yi-Coder-9B-Chat",
        download_dir=vllm_cache.mount_path,
        max_model_len=8096,
    )
    request_output = llm.chat(
        messages=[{"role": "user", "content": prompt}],
    )
    return request_output[0].outputs[0].text

And that’s it! Our function can now be invoked:

yicoder.remote("How can I use `echo` to say hi in my terminal?")

The output will look something like this:

=> Function complete <75f60737-470c-4da4-aab0-5789636ac657> 
In the terminal, you can say "hi" by simply typing `echo "hi"`.

Hosting an OpenAI Compatible Chat API with vLLM

Running inference with vLLM’s engine is great, but it would be much more useful to have a fully functional chat API that we can use already built SDKs to interact with. You could build this functionality yourself, but vLLM provides a great out of the box solution as well.

To get started, we need to import Beam, FastAPI, and various parts of the vLLM library. While we are at it, we will also define a variable with the model name.

from beam import Image, Volume, asgi, env

# These imports are only available in the remote environment
if env.is_remote():
    import asyncio

    import fastapi
    import vllm.entrypoints.openai.api_server as api_server
    from vllm.engine.arg_utils import AsyncEngineArgs
    from vllm.engine.async_llm_engine import AsyncLLMEngine
    from vllm.entrypoints.logger import RequestLogger
    from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
    from vllm.entrypoints.openai.serving_completion import (
        OpenAIServingCompletion,
    )
    from vllm.usage.usage_lib import UsageContext

MODEL_NAME = "01-ai/Yi-Coder-9B-Chat"

Next, let’s set up the same volume from the previous section, which will already have our model cached.

vllm_cache = Volume(name="yicoder", mount_path="./yicoder")

To host the inference server, we will be creating a Beam asgi app. These apps allow you to run any asgi-compatible web server. We will set up our environment with 1 cpu core, 8Gi of memory, and an A100-40.

@asgi(
    image=Image().add_python_packages(["vllm"]),
    volumes=[vllm_cache],
    gpu="A100-40",
    memory="8Gi",
    cpu=1,
    keep_warm_seconds=360,
)

If you are going to be using the model continuously, you may want to set keep_warm_seconds to a value greater so that the server doesn’t restart between invocations. By default, the server will be shutdown after 2 minutes of inactivity.

With our environment configured, we can now create our server. We will start by creating a FastAPI app and adding a health check endpoint. This health check is required as it will be used by vLLM to ensure that the server is ready. We will also add the vLLM router to the app.

def yicoder_server():
    app = fastapi.FastAPI(
        title=f"{MODEL_NAME} server",
        docs_url="/docs",
    )

    # Health check is required as it will be checked during setup for vllm
    @app.get("/health")
    async def health_check():
        return {"status": "healthy"}

    app.include_router(api_server.router)

Next, we can create the vLLM engine client, which will be used to fulfill our inference requests.

    engine_args = AsyncEngineArgs(
        model=MODEL_NAME,
        tensor_parallel_size=1,
        gpu_memory_utilization=0.90,
        max_model_len=8096,
        enforce_eager=False,
        download_dir=vllm_cache.mount_path,
    )

    async_engine_client = AsyncLLMEngine.from_engine_args(
        engine_args, usage_context=UsageContext.OPENAI_API_SERVER
    )

Finally, we can set up the API server and return our app. We also create a request logger. This is optional, but it will log each incomming request with its prompt and parameters.

    model_config = asyncio.run(async_engine_client.get_model_config())

    request_logger = RequestLogger(max_log_len=2048)

    api_server.openai_serving_chat = OpenAIServingChat(
        async_engine_client,
        model_config=model_config,
        served_model_names=[MODEL_NAME],
        chat_template=None,
        response_role="assistant",
        lora_modules=[],
        prompt_adapters=[],
        request_logger=request_logger,
    )

    api_server.openai_serving_completion = OpenAIServingCompletion(
        async_engine_client,
        model_config=model_config,
        served_model_names=[MODEL_NAME],
        lora_modules=[],
        prompt_adapters=[],
        request_logger=request_logger,
    )

    return app

To deploy our app, we can use the beam deploy command.

beam deploy yi.py:yicoder_server --name yicoder-server

Now that we’ve deployed our Yi-Coder server, let’s build a simple CLI application to interact with it. It will track our conversation history and allow us to chat with the model. Since we are using the OpenAI compatible API, we will use the openai python package to interact with the API.

from openai import OpenAI


def chat_with_gpt():
    openai_completions_api_version = "v1"
    beam_deployment_url = "https://app.beam.cloud/asgi/yicoder-server/v1"

    client = OpenAI(
        api_key="YOUR_BEAM_API_KEY",
        base_url=f"{beam_deployment_url}/{openai_completions_api_version}",
    )

    conversation_history = []

    print("Welcome to the CLI Chat Application!")
    print("Type 'quit' to exit the conversation.")

    if client.models.list().data[0].id == "01-ai/Yi-Coder-9B-Chat":
        print("Model is ready")
    else:
        print("Failed to load model")
        exit(1)

    try:
        while True:
            user_input = input("You: ")

            if user_input.lower() == "quit":
                print("Goodbye!")
                break

            conversation_history.append({"role": "user", "content": user_input})
            response = client.chat.completions.create(
                model="01-ai/Yi-Coder-9B-Chat", messages=conversation_history
            )
            assistant_reply = response.choices[0].message.content
            conversation_history.append(
                {"role": "assistant", "content": assistant_reply}
            )

            print("Assistant:", assistant_reply)

    except KeyboardInterrupt:
        print("\nExiting the chat.")


if __name__ == "__main__":
    chat_with_gpt()

If you run this script, you will be able to chat with your Yi-Coder model. Here is an example conversation I had with it:

Welcome to the CLI Chat Application!
Type 'quit' to exit the conversation.
Model is ready
You: hi
Assistant: Hello! How can I assist you today?
You: How can I format a string in Go?
Assistant: In Go, you can use the `fmt` package to format strings. Here's a simple example:

\```
package main

import (
        "fmt"
)

func main() {
        name := "John"
        age := 25

        // Formatting the string
        message := fmt.Sprintf("My name is %s and I am %d years old.", name, age)

        fmt.Println(message)
}
\```

In the `Sprintf` function, `%s` is a placeholder for a string and `%d` for an integer. `Sprintf` stands for 'string format print', it formats the input following the rules within the string and returns the result as a string.

You can use other placeholders as well, like `%f` for floats, `%t` for booleans, etc. You can find the full list of placeholders in Go's fmt package documentation.

Note that this code must be in a Go environment to run. If you run this outside of a Go environment, it might not work as expected.
You: Awesome!
Assistant: You're welcome! If you have other questions, feel free to ask.