Creating jobs from a notebook
Within a Python notebook define a function that trains and stores it in the model registry. Then create the job in Modelbit with mb.add_job(...)
:
from sklearn import linear_model
def train():
lm = linear_model.LinearRegression()
lm.fit([[1], [2], [3]], [2, 4, 6])
mb.add_model("example_model", lm)
mb.add_job(train, deployment_name="training_example")
This will create a new deployment called training_example
, and create a job called train
within it. Click View in Modelbit
to see the job. In the Source Code
tab you'll also see the code used to define this job, copied from your notebook.
Return to the Jobs tab for train
and then click Run Now
to run the job. Once the job finishes you can fetch the results of the job in your notebook with mb.get_model
:
mb.get_model("example_model")
This call returns the linear regression that was created after running the train
job. Read on for how to use the results of training jobs in inference functions, as well as how to schedule and parameterize your jobs.
Training jobs examples
Now that you've created your first job let's look at some more advanced use cases. These training jobs take advantage of the model registry.
Retraining a model used in a inference function
In this example, we'll create an example deployment that uses a linear regression to double numbers. Then we'll use a job to retrain and redeploy the linear model used for doubling numbers.
First we'll define a function to train and store our model called train_doubler
. Then we'll use the model in an inference function called predict_double
:
from sklearn import linear_model
import time
def train_doubler():
lm = linear_model.LinearRegression()
# we're using time.time() to mimic dynamic training data in this example
lm.fit([[1], [2], [time.time()]], [2, 4, time.time() * 2])
mb.add_model("doubler_model", lm)
train_doubler()
def predict_double(number: int) -> int:
doubler_model = mb.get_model("doubler_model")
return doubler_model.predict([[number]])[0]
predict_double(5)
Then we deploy the inference function predict_double
:
import sklearn
mb.deploy(predict_double, python_packages=[f"scikit-learn=={sklearn.__version__}"])
At this point Modelbit knows about the inference function and has created API endpoints for it, but Modelbit doesn't have a job to retrain the model.
We'll add our training job train_doubler
to our deployment. Whenever the job runs, it'll update doubler_model
in the registry, which will be used by predict_double
for inferences.
mb.add_job(train_doubler, deployment_name="predict_double")
Click Run Now
in the job's page in Modelbit to update doubler_model
.
Retraining with refreshed data on a schedule
Like the example above, we'll create a training function and a deployment function. Adding to the above, we'll call mb.get_dataset(...)
in our training function to fetch a dataset from Modelbit that we'll use for training. When we send the job to Modelbit we'll configure it to refresh the dataset before running the job every night.
First, the training function train_lead_scorer
and deployment function score_lead
:
def train_lead_scorer():
training_data = mb.get_dataset("leads")
X = training_data.drop('CONVERTED', axis=1)
y = training_data['CONVERTED']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
...
model = make_pipeline(pipeline, LogisticRegression(max_iter=1000, random_state=42))
model.fit(X_train, y_train)
mb.add_model("lead_scoring_model", model)
def score_lead(hdyhau: str, utm_source: str, industry: str) -> float:
df = pd.DataFrame.from_records([{
"HDYHAU": hdyhau,
"UTM_SOURCE": utm_source,
"INDUSTRY": industry
}])
lead_scoring_model = mb.get_model("lead_scoring_model")
return lead_scoring_model.predict_proba(df)[0][1]
score_lead("email", "google", "Entertainment")
Then we'll deploy score_lead
:
mb.deploy(score_lead)
Lastly we create the training job, which will run automatically every night:
mb.add_job(train_lead_scorer,
deployment_name="score_lead", # to add the training job to the `score_lead` deployment
schedule="daily", # to run the training job every night
refresh_datasets=["leads"]) # to refresh the `leads` dataset before executing the training job
Running jobs with arguments
Training jobs can accept arguments, which can be useful for testing different training parameters. This example changes the max_iter
parameter of a LogisticRegression
using an argument to the training job:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
def train_iris_lr(max_training_iter: int):
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(max_iter=max_training_iter).fit(X, y)
mb.add_model("iris_model", clf)
mb.add_job(train_iris_lr,
deployment_name="training_example", # adding a second job to the first example
default_arguments=[500]) # default arguments can be overridden later
Above, the default_arguments
lets us send 500
as the max_training_iter
argument in train_iris_lr
. If the training function accepted multiple arguments, we'd send multiple values in the list of default_arguments
.
The default_arguments
parameter must be a list of numbers or strings, and there should be one item in the list for each parameter into the training function.
Running a job and waiting for its results
In addition to creating jobs from the notebook, you can also run jobs and fetch their results. We'll run the train_iris_lr
job from the previous example:
job_run_request = mb.run_job(deployment_name="training_example", job_name="train_iris_lr")
job_run_request.wait() # will block until job completes
# then fetch the updated LogisticRegression
mb.get_model("iris_model")
The above job ran with the default arguments of 500
. We can also run it with different arguments, in this case 1000
:
job_run_request = mb.run_job(deployment_name="training_example", job_name="train_iris_lr", arguments=[1000])
job_run_request.wait()
# then fetch the updated LogisticRegression
mb.get_model("iris_model")
Additional parameters for mb.add_job
There are several parameters you can use to customize the behavior of your job.
For a full list of parameters to mb.add_job
, check out the API Reference.
schedule="cron-string"
Modelbit jobs can be run on any schedule you can define with a cron string. You can also use the simpler schedules of hourly
, daily
, weekly
and monthly
:
mb.add_job(my_job, deployment_name="training_example", schedule="daily")
# or
mb.add_job(my_job, deployment_name="training_example", schedule="0 0 * * *")
refresh_datasets=["dataset-name"]
Jobs usually require fresh data to retrain their models. Using the refresh_datasets
parameter tells Modelbit to refresh the datasets used by the job before executing the job:
mb.add_job(my_job, deployment_name="training_example", refresh_datasets=["leads"])
size="size"
If your job requires more CPU or RAM than the default job runner you should use a larger runner. Set the size
parameter to one of the sizes from the runner sizes table:
mb.add_job(my_job, deployment_name="training_example", size="medium")
email_on_failure="your-email"
Modelbit can email you if your job fails. Just set the email_on_failure
parameter to your email address:
mb.add_job(my_job, deployment_name="training_example", email_on_failure="you@company.com")