Skip to main content

Optimizing deployment performance

Improving your deployment's performance is about decreasing latency and increasing throughput. There are several important areas to monitor and optimize when building high performance ML deployments:

  • Network latency: The time it takes to serialize and send your inference request to Modelbit, plus the time to serialize and return the response.
  • Runtime performance: The time it takes to process the inference through your ML model and related business logic.
  • Startup time: The time it takes to load the environment and deployment files needed to run your inference.

Below we'll discuss how to improve the performance of each of these areas.

Network latency

There are several ways to reduce the time it takes to send inferences from your server to your Modelbit deployment.

Batch requests

If you're sending hundreds or thousands of individual requests within a short time span, consider batching them instead.

Each inference request costs a small amount of "overhead". Overhead includes the time it takes to establish a connection between your server and Modelbit, Modelbit's processing of the request's parameters, and queuing the request for execution. Reducing the number of requests can significantly improve the overall throughput of your inferences.

Instead of sending many individual requests all at once, group the requests into batches and use the batch inference API.

Persistent sessions

info

Using the modelbit.get_inference API handles persistent sessions and retries automatically.

When using an HTTP request library like requests, the default behavior is to establish a new connection for each call to the remote server. Establishing a connection can cost hundreds of milliseconds to connect and negotiate the SSL encryption.

Use persistent sessions to maintain your connection to Modelbit between requests. Here's how to use persistent sessions with requests:

# requests.post creates a new connection each time. Use a session instead:
request_session = requests.Session()
request_session.post("https://...")

The more requests you send, the more important session persistence will become.

Serialize and compress

Some deployments expect or return large blobs of data. For example, your deployment might expect several megabytes of numpy arrays, or return a long lists of embeddings. The larger and more complicated the input/output of your deployment, the longer it'll take to process over the network and through Modelbit.

In these situations, serialize and compress your inputs/outputs and you'll dramatically speed up your inference performance. Here's a serialize method (and it's inverse deserialize) that'll convert most kinds of Python objects into an optimized format for sending to/from Modelbit:

import pickle, base64, zlib

def serialize(data) -> str:
# Pickles the object, compresses it, then converts it to a base64 string
return base64.b64encode(zlib.compress(pickle.dumps(data))).decode("utf8")

def deserialize(data64: str):
# Reverses the steps in serialize
return pickle.loads(zlib.decompress(base64.b64decode(data64)))

These methods can reduce tens of megabytes of json objects and numpy arrays to hundreds of kilobytes! That will lead to big time savings over the network.

Here's an example showing how to serialize the data getting sent to your deployment:

inference_input = [random.uniform(0, 1) for _ in range(500)]

modelbit.get_inference(deployment="example_deployment", data=serialize(inference_input), ...)

In your deployment, deserialize the compressed input before sending it to your model:

# main function
def example_deployment(data64: str):
inference_input = deserialize(data64)
return model.predict(inference_input)

For deployments that return large responses the process is similar. Serialize the deployment's return value and then deserialize the response once you receive it.

Runtime performance

Improving the execution speed of your deployment can take many forms, depending on how your model works.

DataFrame mode

For models that consume their input as Pandas DataFrames (like Scikit-Learn and XGBoost), use Modelbit's DataFrame mode and send your inferences in batches. These types of models tend to perform best when processing entire DataFrames instead of individual inferences.

Load data once

When Modelbit starts your deployment it'll load your source.py and then call your main function once for each inference. That means any setup code in source.py outside of the main function only runs once. Take advantage of this by loading models and data only once, outside the main function.

For example, this deployment loads a large model once and reuses it for each inference:

source.py
# code outside the main function only gets called once
my_model = loadExpensiveModel(checkpoint="model.pt")

# main function
def example_deployment(...):
return my_model.predict(...)

If you'r using Git, simply add your setup code outside of the main function. If you're deploying from a notebook, use modelbit.setup to define one-time setup code.

Caching function calls

Most deployments depend on a collection Python functions to perform their inferences. Sometimes these function calls can be quite expensive, slowing down your deployment. If you're frequently calling expensive functions with the same inputs and wanting the same outputs, you should use caching.

info

Calls to Modelbit functions like get_model and get_dataset are cached automatically. So you don't need to cache calls to them.

Use @cache and @lru_cache from functools to cache the results of your expensive function calls. For example:

from functools import cache, lru_cache

# for methods that don't have arguments, use @cache
@cache
def download_config():
...

# for methods that frequently get called with the same arguments, use @lru_cache with a bounded size
@lru_cache(100)
def calculate_similarity(...)
...

Keep in mind that caching consumes memory in your deployment for each unique call to a cached method. If you cache too many large values your deployment can run our of memory and crash.

Startup time

Modelbit loads your deployment's environment before it runs an inference. This loading time is called a "cold start". Modelbit has a variety of caching layers to minimize the time that cold starts can take. Here's how you can reduce the cold start time of your deployments:

Minimize environment size

The size of your deployment's environment has a large impact on startup performance. Specify the minimum necessary Python libraries in your requirements.txt to keep your environment small. Some Python libraries can add gigabytes to your environment when they're installed, so only include them when necessary.

Pre-download checkpoints

Some models, especially transformers from HuggingFace, will download their checkpoints when the model is first loaded if the files aren't found locally. Downloading checkpoints from HuggingFace (or elsewhere) when starting your deployment is quite slow.

Instead, make sure to include your checkpoint files in your deployment. When checkpoints are added to Modelbit they're a lot faster to load into deployments than anything downloaded from elsewhere because Modelbit has optimized storage for deployments and their dependencies.

You can also use file-based models in the model registry to get the benefits of Modelbit's fast file storage without needing the files stored with the deployment.