Skip to main content

Example sklearn deployment

In this example we'll train a sklearn.linear_model.LinearRegression to predict the number of points an NBA player will earn, based on the number of field goals (baskets) made. Baskets are worth between 1 and 3 points in the NBA.

First, import and log in to Modelbit:

import modelbit
mb = modelbit.login()

Deploying a simple model for one-at-a-time inference:

Then download the sample nba games dataset from Modelbit. We'll predict points, PTS, based on field goals made, FGM. If a player didn't score any points they'll have a null PTS, so we'll drop those rows from training:

df = modelbit.get_dataset("nba games")
df.dropna(inplace=True, subset=["FGM", "PTS"])
df

Now we'll train our model:

from sklearn.linear_model import LinearRegression
points_model = LinearRegression()
points_model.fit(df[["FGM"]].values, df["PTS"].values)
points_model

Now we'll put the trained model in a deployment function. Since we know that fgm can be null, we'll check for that as well:

def predict_points(fgm: int) -> float:
if fgm is None or type(fgm) is not int:
return None
return float(points_model.predict([[fgm]])[0])

Finally, we'll deploy our model to Modelbit with a custom python environment that has the same version of sklearn and Python we used to train the model:

mb.deploy(predict_points)

Modelbit will package predict_points and its dependencies (including points_model) in a Python 3.9 environment with scikit-learn==1.1.1 installed. The deployment can then be called from REST, Snowflake, and Redshift.

Deploying a larger model for batch DataFrame inference

Next, we'll take advantage of Modelbit's DataFrame mode to deploy a slightly larger model designed for batch inference.

Let's take the case of an inbound lead scorer for an enterprise sales team. For this use case, we want to score the leads in batch in our warehouse. We'll use a Scikit-Learn RandomForestClassifier in a pipeline with a OneHotEncoder.

Assuming our features and data about lead conversion is in our leads_data DataFrame, we can start by splitting training and testing data, and training the pipeline:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

X_train, y_train, X_test, y_test = train_test_split(
leads_with_scores.drop(columns=['did_lead_convert']),
leads_with_scores['did_lead_convert'].as_type(int)
)

lead_score_pipeline = Pipeline([
('encoder', OneHotEncoder(handle_unknown = 'ignore')),
('classifier', RandomForestClassifier()
])

lead_score_pipeline.fit(X_train, y_train)

Now that we've trained a lead scoring pipeline, we can test it with our holdback X_test and y_test data. Once that's done, we're ready to deploy!

To deploy this in DataFrame mode suitable for batch processing, write a function that takes a single DataFrame parameter, and uses it to score the leads in that DataFrame:

def score_leads_batch(lead_features: pd.DataFrame) -> np.ndarray:
return lead_score_pipeline.predict(lead_features)

Finally, we can deploy this model, giving Modelbit an example DataFrame so it knows how to transform the data on the way in:

mb.deploy(score_leads_batch, dataframe_mode=True, example_dataframe=X_train)

Next steps

In the next example we'll show how data cleaning and feature engineering used during training can also be used by a deployment in production to convert input JSON data to a model-friendly dataframe.