Skip to main content

Example Llama 7B Deployment

In this example, we'll use llama-cpp-python to deploy a Llama2 7B model as a REST endpoint backed by a GPU.

info

This example only works on GPUs. Make sure you enable GPUs when deploying this model.

Setup​

We recommend using Google Colab with high memory and a T4 GPU for this example.

First, download llama-2-7b.Q5_K_M.gguf from Hugging Face to your notebook environment. Store this file in the same directory as your notebook.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q5_K_M.gguf

Then pip install llama_cpp_python with GPU support.

pip install --extra-index-url=https://abetlen.github.io/llama-cpp-python/whl/cu122/ llama_cpp_python

Creating the deployment​

Our deployment loads the Llama LLM during initialization using modelbit.setup and calls it during inference time in run_prompt:

from llama_cpp.llama import Llama
import modelbit

with modelbit.setup("load_llm"):
llm = Llama(model_path="./llama-2-7b.Q5_K_M.gguf", n_gpu_layers=100, n_ctx=2048)

def run_prompt(prompt: str, max_tokens: int):
return llm(prompt, max_tokens=max_tokens)["choices"][0]["text"]

Call run_prompt to see it work locally:

run_prompt("Larry is a fun llama.", 100)

Which will return something like the following:

And he’s not wearing his pajamas.

With your function working locally, you're ready to deploy!

Deploying your Llama to Modelbit​

First log in to Modelbit:

import modelbit
mb = modelbit.login()

Then deploy run_prompt. When Modelbit builds your environment, it will automatically configure the CMAKE_ARGS and related NVIDIA libraries necessary to run your Llama model on the GPU.

mb.deploy(run_prompt,
extra_files=["llama-2-7b.Q5_K_M.gguf"],
setup="load_llm",
require_gpu=True)

Notice that we loaded llama-2-7b.Q5_K_M.gguf with a relative path in get_llm. We'll upload it with the same relative path in our deployment using extra_files.

After run_prompt has deployed, you can call it:

curl -s -XPOST "http://...modelbit.com/v1/run_prompt/latest" -d '{"data": ["Larry is a fun llama.", 100]}' | json_pp

You have successfully deploy a Llama2 7B model as a REST endpoint!