Example XGBoost deployment
In this example we will train an xgboost.XGBClassifier
to predict the position of
a basketball player based on the number of points they scored, and how long they were
on the court. Predicted positions can be None, C=Center, F=Forward, and G=Guard.
We'll also structure our notebook so that the feature engineering and encoders we use for training can be shared with the deployment in production.
First, in your notebook, import and log in to Modelbit:
import modelbit
mb = modelbit.login()
In Modelbit, make two new datasets based off nba games
:
The first: nba_game_stats
:
select TEAM_ABBR, START_POSITION, MINUTES_PLAYED, PTS
from NBA_GAME_DETAILS
And the second: nba_average_game_points
:
select TEAM_ABBR, avg(PTS) as avg_game_points
from NBA_GAME_DETAILS group by 1
Then we'll download both to our notebook and join them for our training data:
import pandas as pd
df_details = mb.get_dataset("nba_game_stats")
df_summary = mb.get_dataset("nba_average_game_points")
df = pd.merge(df_details, df_summary, how="left", on=["TEAM_ABBR"])
del df_details, df_summary
df = df.head(5_000) # limit dataset for quick perf on laptops
df
Now that we have our data ready, let's start with cleaning and feature engineering:
Data cleaning & feature engineering
We'll scrub the data to remove nulls, fix formatting, etc. We're using a function so we can reuse this cleaning logic in the deployment:
def clean_data(input_df):
if "START_POSITION" in input_df: # production calls to the deployment won't have the field we're predicting
input_df["START_POSITION"] = input_df["START_POSITION"].fillna("NONE")
input_df["PTS"] = input_df["PTS"].fillna(0)
input_df["MINUTES_PLAYED"] = input_df["MINUTES_PLAYED"].fillna("0").apply(lambda x: int(str(x).split(":")[0]))
clean_data(df)
df
We'll collect summary stats and onehot encodings needed for feature transformations. The deployment won't have access to the training data to know all the team names for onehot encoding. So instead we create a OneHotEncoder that we can reuse in the deployment. By storing this helper data in variables instead of applying it directly to the dataframe we'll be able to use it again later.
from sklearn.preprocessing import OneHotEncoder
team_encoder = OneHotEncoder(handle_unknown='ignore')
team_encoder.fit(df[["TEAM_ABBR"]])
team_encoder.categories_
Now we'll apply feature engineering to the data frame. Again, we'll use a function so that we can share this logic with the deployment in production.
import numpy as np
def apply_features(input_df):
# Set the TEAM_ABBR column as categorical so we can onehot the missing values to 0
input_df["TEAM_ABBR"] = input_df["TEAM_ABBR"].astype(
pd.api.types.CategoricalDtype(team_encoder.categories_[0]))
output_df = pd.concat([ # Concat onehot'd columns as bools with the rest of the features
input_df.drop(["TEAM_ABBR"], axis=1),
pd.get_dummies(input_df["TEAM_ABBR"], prefix="TEAM_ABBR").astype(bool)
], axis=1)
# We can make more features here as well
output_df["PLAYED_AT_ALL"] = (output_df["MINUTES_PLAYED"] > 0)
output_df["POINTS_PER_MINUTE"] = (
output_df["PTS"] / output_df["MINUTES_PLAYED"]
).replace([np.inf, -np.inf], np.nan).fillna(0)
return output_df
df_model = apply_features(df)
display(df_model)
df_model.dtypes
Our data is ready for training.
Training the XGBoost model
First, we'll create the train/test split:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
start_position_encoder = LabelEncoder()
X = df_model.drop(["START_POSITION"], axis=1)
y = start_position_encoder.fit_transform(df_model["START_POSITION"])
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.7, random_state=42)
y[:100], start_position_encoder.classes_
The search the parameter space with RandomizedSearchCV
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
params = { 'max_depth': [5, 10],
'learning_rate': [0.1, 0.3],
'subsample': np.arange(0.7, 1.0, 0.1),
'colsample_bytree': np.arange(0.7, 1.0, 0.1),
'colsample_bylevel': np.arange(0.7, 1.0, 0.1),
'n_estimators': [3, 5, 10],
'max_delta_step': [0, 0.3, 0.7],
'min_child_weight': [1, 3, 5],
'gamma': [0, 0.1, 0.3],
'reg_lambda': [0.5, 3.0, 5.0],
'reg_alpha': [0, 0.1, 2]}
xgbr = xgb.XGBClassifier(seed = 42)
clf = RandomizedSearchCV(estimator=xgbr,
param_distributions=params,
scoring='neg_log_loss',
n_iter=2,
verbose=1,
n_jobs=-1,
cv = [(slice(None), slice(None))])
clf.fit(Xtrain, ytrain)
print("Best parameters:", clf.best_params_)
print("Lowest loss: ", (-clf.best_score_))
Finally, we'll make a model using the best results from the parameter search:
model = xgb.XGBClassifier(use_label_encoder = False, scale_pos_weight=6, subsample=0.7, reg_lambda=5.0,
reg_alpha=0, n_estimators=500, min_child_weight=5, max_depth=10, max_delta_step=0.3,
learning_rate=0.3, gamma=0, colsample_bytree=0.89, colsample_bylevel=0.7, random_state=20)
model.fit(Xtrain, ytrain)
Time to deploy our model to Modelbit!
Model deployment
We'll deploy our model with the associated helper functions and data. We'll wrap the model in a deploy function, and give it a unit test. The input data will be a JSON object that we'll convert into a dataframe:
# fix the column order in the df, as some models/transforms may be sensitive to order
df_col_order = list(df.drop(["START_POSITION"], axis=1).columns)
def position_predictor(data):
"""
>>> position_predictor({ "TEAM_ABBR": "DAL", "MINUTES_PLAYED": 30, "PTS": 15, "AVG_GAME_POINTS": 10.1 })
"G"
"""
df = pd.DataFrame.from_dict([data])[df_col_order]
clean_data(df)
df_predict = apply_features(df)
prediction = model.predict(df_predict)
return start_position_encoder.inverse_transform(prediction)[0]
mb.deploy(position_predictor)
Modelbit will then run the unit tests and then package position_predictor
and its dependencies
including functions clean_data
and apply_features
, and values team_encoder
and start_position_encoder
into a deployment that can then be called from REST, Snowflake, and Redshift.