Skip to main content

Batch DataFrame deployments

Many ML libraries are designed to operate in batch on Pandas DataFrames. They take their input features as a DataFrame parameter, and return some type of iterable like a DataFrame or Numpy Array with inferences.

When calling these models in offline batch scenarios, such as in a data warehouse or dbt model, Modelbit can send your model an entire batch as a DataFrame, and accept any iterable as a return value.

Writing your deploy function for DataFrame mode

For DataFrame mode, your deploy function should accept exactly one parameter, a Pandas DataFrame. After using that DataFrame for inferences, it should return any Python iterable with the same length as the input DataFrame.

For example, with a simple Scikit-Learn regression:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

X_train = pd.DataFrame({
"feature_one": [1, 2, 3],
"feature_two": [2, 4, 6]
})

y_train = pd.DataFrame({
"result": [3, 6, 9]
})

regression = LinearRegression().fit(X_train, y_train)

def get_predictions(features_df: pd.DataFrame) -> np.ndarray:
return regression.predict(features_df)

Deploying your model in DataFrame mode

To tell Modelbit to deploy in batch DataFrame mode, supply two extra parameters to mb.deploy:

  • dataframe_mode (Boolean): Set dataframe_mode to True
  • example_dataframe (DataFrame): Give Modelbit an example_dataframe with the same column names and types as the DataFrame your function expects. Modelbit uses this example to generate sample SQL code and transform inputs from SQL objects to DataFrames at runtime in production.

To deploy our example Scikit-Learn regression above, write:

mb.deploy(get_predictions, dataframe_mode = True, example_dataframe = X_train)

In this case, we reuse the training DataFrame X_train as the example_dataframe in the mb.deploy call because it is shaped exactly the way the deploy function expects its inputs. In cases like these, this is good practice.

If you wish to avoid sending your actual training data to Modelbit, you can strip the data out of the dataframe but keep its shape and data types by calling .head(0) on it. Your mb.deploy call would then look like this:

mb.deploy(get_predictions, dataframe_mode = True, example_dataframe = X_train.head(0))

Calling your DataFrame-mode model from Python

You can use modelbit.get_inference to call your DataFrame-mode model. Supply the dataframe as the data= parameter:

import modelbit
modelbit.get_inference(workspace="your-workspace", deployment="get_predictions", data=X_train)

This is equivalent to sending data as a dictionary formatted like this:

[
[0, { "feature_one": 1, "feature_two": 2 }],
[1, { "feature_one": 2, "feature_two": 4 }],
[2, { "feature_one": 3, "feature_two": 6 }]
]

Calling your DataFrame-mode model from SQL

To call your DataFrame-mode model from SQL, supply a single object shaped like the DataFrame your Python function receives as a parameter, with names the same as the DataFrame's columns.

E.g. in Snowflake:

select my_schema.get_predictions_latest({
'feature_one': feature_one_col,
'feature_two': feature_two_col
})
from my_table;

In Redshift, making use of the object function:

select my_schema.get_predictions_latest(json_serialize(object(
'feature_one', feature_one_col,
'feature_two', feature_two_col
)))
from my_table;

Modelbit will generate example SQL for you in the "API Endpoints" screen of your model. Your warehouse and Modelbit will handle the batching of calls automatically.