Example sklearn deployment
In this example we'll train a
sklearn.linear_model.LinearRegression to predict the number of
points an NBA player will earn, based on the number of field goals (baskets) made. Baskets are
worth between 1 and 3 points in the NBA.
First, import and log in to Modelbit:
mb = modelbit.login()
Deploying a simple model for one-at-a-time inference:
Then download the sample
nba games dataset from Modelbit. We'll predict points,
based on field goals made,
FGM. If a player didn't score any points they'll have a
PTS, so we'll drop those rows from training:
df = modelbit.get_dataset("nba games")
df.dropna(inplace=True, subset=["FGM", "PTS"])
Now we'll train our model:
from sklearn.linear_model import LinearRegression
points_model = LinearRegression()
We'll put the trained model in a deployment function, with type checking and
unit tests. Since we know that
fgm can be null, we'll
check for that as well:
def predict_points(fgm: int) -> float:
This deployment predicts a player's score based on baskets made.
if fgm is None or type(fgm) is not int:
Finally, we'll deploy our model to Modelbit with a custom python environment
that has the same version of
sklearn and Python we used to train the model:
Modelbit will then run the unit tests and then package
predict_points and its dependencies
points_model) in a Python 3.9 environment with
The deployment can then be called from REST, Snowflake, and Redshift.
Deploying a larger model for batch DataFrame inference
Next, we'll take advantage of Modelbit's DataFrame mode to deploy a slightly larger model designed for batch inference.
Let's take the case of an inbound lead scorer for an enterprise sales team. For this use case, we want to score the leads in batch in our warehouse. We'll use a Scikit-Learn RandomForestClassifier in a pipeline with a OneHotEncoder.
Assuming our features and data about lead conversion is in our
leads_data DataFrame, we can start by splitting
training and testing data, and training the pipeline:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(
lead_score_pipeline = Pipeline([
('encoder', OneHotEncoder(handle_unknown = 'ignore')),
Now that we've trained a lead scoring pipeline, we can test it with our holdback
Once that's done, we're ready to deploy!
To deploy this in DataFrame mode suitable for batch processing, write a function that takes a single DataFrame parameter, and uses it to score the leads in that DataFrame:
def score_leads_batch(lead_features: pd.DataFrame) -> np.ndarray:
Finally, we can deploy this model, giving Modelbit an example DataFrame so it knows how to transform the data on the way in:
mb.deploy(score_leads_batch, dataframe_mode=True, example_dataframe=X_train)
In the next example we'll show how data cleaning and feature engineering used during training can also be used by a deployment in production to convert input JSON data to a model-friendly dataframe.