Example Llama 7B Deployment w/GPU
In this example, we'll use llama-cpp-python
to deploy a Llama2 7B model as a REST endpoint.
This example only works on GPUs. Make sure you enable GPUs when deploying this model.
Setupβ
We recommend using Google Collab with high memory and a T4 GPU for this example.
First, download llama-2-7b.Q5_K_M.gguf
from Hugging Face to your notebook environment. Store this file in the same directory as your notebook.
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q5_K_M.gguf
Then pip
install llama_cpp_python
with GPU support.
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama_cpp_python
Creating the deploymentβ
We'll use two functions in our deployment. The first instantiates Llama and caches the result. The caching will speed up inferences when you call this deployment in Modelbit. The second function will simply pass a prompt to the Llama model and return the result.
from llama_cpp.llama import Llama
from functools import cache
@cache
def get_llm():
return Llama(model_path="./llama-2-7b.Q5_K_M.gguf", n_gpu_layers=100, n_ctx=2048)
def run_prompt(prompt: str, max_tokens: int):
llm = get_llm()
return llm(prompt, max_tokens=max_tokens)["choices"][0]["text"]
Call run_prompt
to see it work locally:
run_prompt("Larry is a fun llama.", 100)
Which will return something like the following:
And heβs not wearing his pajamas.
With your function working locally, you're ready to deploy!
Deploying your Llama to Modelbitβ
First log in to Modelbit:
import modelbit
mb = modelbit.login()
Then deploy run_prompt
. When Modelbit builds your environment, it will automatically configure the CMAKE_ARGS
and related NVIDIA libraries necessary to run your Llama model on the GPU.
mb.deploy(run_prompt,
extra_files=["llama-2-7b.Q5_K_M.gguf"],
require_gpu=True)
Notice that we loaded llama-2-7b.Q5_K_M.gguf
with a relative path in get_llm
. We'll upload it with the same relative path in our deployment using extra_files
.
After run_prompt
has deployed, you can call it:
curl -s -XPOST "http://...modelbit.com/v1/run_prompt/latest" -d '{"data": ["Larry is a fun llama.", 100]}' | json_pp
You have successfully deploy a Llama2 7B model as a REST endpoint!