Skip to main content

Example XGBoost deployment

In this example we will train an xgboost.XGBClassifier to predict the position of a basketball player based on the number of points they scored, and how long they were on the court. Predicted positions can be None, C=Center, F=Forward, and G=Guard.

We'll also structure our notebook so that the feature engineering and encoders we use for training can be shared with the deployment in production.

First, in your notebook, import and log in to Modelbit:

import modelbit
mb = modelbit.login()

In Modelbit, make two new datasets based off nba games:

The first: nba_game_stats:

select TEAM_ABBR, START_POSITION, MINUTES_PLAYED, PTS
from NBA_GAME_DETAILS

And the second: nba_average_game_points:

select TEAM_ABBR, avg(PTS) as avg_game_points
from NBA_GAME_DETAILS group by 1

Then we'll download both to our notebook and join them for our training data:

import pandas as pd

df_details = mb.get_dataset("nba_game_stats")
df_summary = mb.get_dataset("nba_average_game_points")

df = pd.merge(df_details, df_summary, how="left", on=["TEAM_ABBR"])
del df_details, df_summary
df = df.head(5_000) # limit dataset for quick perf on laptops
df

Now that we have our data ready, let's start with cleaning and feature engineering:

Data cleaning & feature engineering

We'll scrub the data to remove nulls, fix formatting, etc. We're using a function so we can reuse this cleaning logic in the deployment:

def clean_data(input_df):
if "START_POSITION" in input_df: # production calls to the deployment won't have the field we're predicting
input_df["START_POSITION"] = input_df["START_POSITION"].fillna("NONE")
input_df["PTS"] = input_df["PTS"].fillna(0)
input_df["MINUTES_PLAYED"] = input_df["MINUTES_PLAYED"].fillna("0").apply(lambda x: int(str(x).split(":")[0]))

clean_data(df)
df

We'll collect summary stats and onehot encodings needed for feature transformations. The deployment won't have access to the training data to know all the team names for onehot encoding. So instead we create a OneHotEncoder that we can reuse in the deployment. By storing this helper data in variables instead of applying it directly to the dataframe we'll be able to use it again later.

from sklearn.preprocessing import OneHotEncoder

team_encoder = OneHotEncoder(handle_unknown='ignore')
team_encoder.fit(df[["TEAM_ABBR"]])
team_encoder.categories_

Now we'll apply feature engineering to the data frame. Again, we'll use a function so that we can share this logic with the deployment in production.

import numpy as np

def apply_features(input_df):
# Set the TEAM_ABBR column as categorical so we can onehot the missing values to 0
input_df["TEAM_ABBR"] = input_df["TEAM_ABBR"].astype(
pd.api.types.CategoricalDtype(team_encoder.categories_[0]))
output_df = pd.concat([ # Concat onehot'd columns as bools with the rest of the features
input_df.drop(["TEAM_ABBR"], axis=1),
pd.get_dummies(input_df["TEAM_ABBR"], prefix="TEAM_ABBR").astype(bool)
], axis=1)

# We can make more features here as well
output_df["PLAYED_AT_ALL"] = (output_df["MINUTES_PLAYED"] > 0)
output_df["POINTS_PER_MINUTE"] = (
output_df["PTS"] / output_df["MINUTES_PLAYED"]
).replace([np.inf, -np.inf], np.nan).fillna(0)
return output_df

df_model = apply_features(df)

display(df_model)
df_model.dtypes

Our data is ready for training.

Training the XGBoost model

First, we'll create the train/test split:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

start_position_encoder = LabelEncoder()

X = df_model.drop(["START_POSITION"], axis=1)
y = start_position_encoder.fit_transform(df_model["START_POSITION"])

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.7, random_state=42)

y[:100], start_position_encoder.classes_

The search the parameter space with RandomizedSearchCV

import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

params = { 'max_depth': [5, 10],
'learning_rate': [0.1, 0.3],
'subsample': np.arange(0.7, 1.0, 0.1),
'colsample_bytree': np.arange(0.7, 1.0, 0.1),
'colsample_bylevel': np.arange(0.7, 1.0, 0.1),
'n_estimators': [3, 5, 10],
'max_delta_step': [0, 0.3, 0.7],
'min_child_weight': [1, 3, 5],
'gamma': [0, 0.1, 0.3],
'reg_lambda': [0.5, 3.0, 5.0],
'reg_alpha': [0, 0.1, 2]}

xgbr = xgb.XGBClassifier(seed = 42)

clf = RandomizedSearchCV(estimator=xgbr,
param_distributions=params,
scoring='neg_log_loss',
n_iter=2,
verbose=1,
n_jobs=-1,
cv = [(slice(None), slice(None))])

clf.fit(Xtrain, ytrain)
print("Best parameters:", clf.best_params_)
print("Lowest loss: ", (-clf.best_score_))

Finally, we'll make a model using the best results from the parameter search:

model = xgb.XGBClassifier(use_label_encoder = False, scale_pos_weight=6, subsample=0.7, reg_lambda=5.0,
reg_alpha=0, n_estimators=500, min_child_weight=5, max_depth=10, max_delta_step=0.3,
learning_rate=0.3, gamma=0, colsample_bytree=0.89, colsample_bylevel=0.7, random_state=20)
model.fit(Xtrain, ytrain)

Time to deploy our model to Modelbit!

Model deployment

We'll deploy our model with the associated helper functions and data. We'll wrap the model in a deploy function called position_predictor. The input data will be a JSON object that we'll convert into a dataframe:

# fix the column order in the df, as some models/transforms may be sensitive to order
df_col_order = list(df.drop(["START_POSITION"], axis=1).columns)

def position_predictor(data):
df = pd.DataFrame.from_dict([data])[df_col_order]
clean_data(df)
df_predict = apply_features(df)
prediction = model.predict(df_predict)
return start_position_encoder.inverse_transform(prediction)[0]

mb.deploy(position_predictor)

Modelbit will package position_predictor and its dependencies including functions clean_data and apply_features, and values team_encoder and start_position_encoder into a deployment that can then be called from REST, Snowflake, and Redshift.