Example Llama 7B Deployment
In this example, we'll use llama-cpp-python
to deploy a Llama2 7B model as a REST endpoint backed by a GPU.
This example only works on GPUs. Make sure you enable GPUs when deploying this model.
Setupβ
We recommend using Google Colab with high memory and a T4 GPU for this example.
First, download llama-2-7b.Q5_K_M.gguf
from Hugging Face to your notebook environment. Store this file in the same directory as your notebook.
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q5_K_M.gguf
Then pip
install llama_cpp_python
with GPU support.
pip install --extra-index-url=https://abetlen.github.io/llama-cpp-python/whl/cu122/ llama_cpp_python
Creating the deploymentβ
Our deployment loads the Llama
LLM during initialization using modelbit.setup
and calls it during inference time in run_prompt
:
from llama_cpp.llama import Llama
import modelbit
with modelbit.setup("load_llm"):
llm = Llama(model_path="./llama-2-7b.Q5_K_M.gguf", n_gpu_layers=100, n_ctx=2048)
def run_prompt(prompt: str, max_tokens: int):
return llm(prompt, max_tokens=max_tokens)["choices"][0]["text"]
Call run_prompt
to see it work locally:
run_prompt("Larry is a fun llama.", 100)
Which will return something like the following:
And heβs not wearing his pajamas.
With your function working locally, you're ready to deploy!
Deploying your Llama to Modelbitβ
First log in to Modelbit:
import modelbit
mb = modelbit.login()
Then deploy run_prompt
. When Modelbit builds your environment, it will automatically configure the CMAKE_ARGS
and related NVIDIA libraries necessary to run your Llama model on the GPU.
mb.deploy(run_prompt,
extra_files=["llama-2-7b.Q5_K_M.gguf"],
setup="load_llm",
require_gpu=True)
Notice that we loaded llama-2-7b.Q5_K_M.gguf
with a relative path in get_llm
. We'll upload it with the same relative path in our deployment using extra_files
.
After run_prompt
has deployed, you can call it:
curl -s -XPOST "http://...modelbit.com/v1/run_prompt/latest" -d '{"data": ["Larry is a fun llama.", 100]}' | json_pp
You have successfully deploy a Llama2 7B model as a REST endpoint!