Skip to main content

Creating jobs from a notebook

Within a Python notebook define a function that trains and stores it in the model registry. Then create the job in Modelbit with mb.add_job(...):

from sklearn import linear_model

def train():
lm = linear_model.LinearRegression()
lm.fit([[1], [2], [3]], [2, 4, 6])
mb.add_model("example_model", lm)

mb.add_job(train, deployment_name="training_example")

This will create a new deployment called training_example, and create a job called train within it. Click View in Modelbit to see the job. In the Source Code tab you'll also see the code used to define this job, copied from your notebook.

Return to the Jobs tab for train and then click Run Now to run the job. Once the job finishes you can fetch the results of the job in your notebook with mb.get_model:

mb.get_model("example_model")

This call returns the linear regression that was created after running the train job. Read on for how to use the results of training jobs in inference functions, as well as how to schedule and parameterize your jobs.

Training jobs examples

Now that you've created your first job let's look at some more advanced use cases. These training jobs take advantage of the model registry.

Retraining a model used in a inference function

In this example, we'll create an example deployment that uses a linear regression to double numbers. Then we'll use a job to retrain and redeploy the linear model used for doubling numbers.

First we'll define a function to train and store our model called train_doubler. Then we'll use the model in an inference function called predict_double:

from sklearn import linear_model
import time

def train_doubler():
lm = linear_model.LinearRegression()
# we're using time.time() to mimic dynamic training data in this example
lm.fit([[1], [2], [time.time()]], [2, 4, time.time() * 2])
mb.add_model("doubler_model", lm)

train_doubler()

def predict_double(number: int) -> int:
doubler_model = mb.get_model("doubler_model")
return doubler_model.predict([[number]])[0]

predict_double(5)

Then we deploy the inference function predict_double:

import sklearn

mb.deploy(predict_double, python_packages=[f"scikit-learn=={sklearn.__version__}"])

At this point Modelbit knows about the inference function and has created API endpoints for it, but Modelbit doesn't have a job to retrain the model.

We'll add our training job train_doubler to our deployment. Whenever the job runs, it'll update doubler_model in the registry, which will be used by predict_double for inferences.

mb.add_job(train_doubler, deployment_name="predict_double")

Click Run Now in the job's page in Modelbit to update doubler_model.

Retraining with refreshed data on a schedule

Like the example above, we'll create a training function and a deployment function. Adding to the above, we'll call mb.get_dataset(...) in our training function to fetch a dataset from Modelbit that we'll use for training. When we send the job to Modelbit we'll configure it to refresh the dataset before running the job every night.

First, the training function train_lead_scorer and deployment function score_lead:

def train_lead_scorer():
training_data = mb.get_dataset("leads")
X = training_data.drop('CONVERTED', axis=1)
y = training_data['CONVERTED']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
...
model = make_pipeline(pipeline, LogisticRegression(max_iter=1000, random_state=42))
model.fit(X_train, y_train)
mb.add_model("lead_scoring_model", model)


def score_lead(hdyhau: str, utm_source: str, industry: str) -> float:
df = pd.DataFrame.from_records([{
"HDYHAU": hdyhau,
"UTM_SOURCE": utm_source,
"INDUSTRY": industry
}])
lead_scoring_model = mb.get_model("lead_scoring_model")
return lead_scoring_model.predict_proba(df)[0][1]

score_lead("email", "google", "Entertainment")

Then we'll deploy score_lead:

mb.deploy(score_lead)

Lastly we create the training job, which will run automatically every night:

mb.add_job(train_lead_scorer,
deployment_name="score_lead", # to add the training job to the `score_lead` deployment
schedule="daily", # to run the training job every night
refresh_datasets=["leads"]) # to refresh the `leads` dataset before executing the training job

Running jobs with arguments

Training jobs can accept arguments, which can be useful for testing different training parameters. This example changes the max_iter parameter of a LogisticRegression using an argument to the training job:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

def train_iris_lr(max_training_iter: int):
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(max_iter=max_training_iter).fit(X, y)
mb.add_model("iris_model", clf)

mb.add_job(train_iris_lr,
deployment_name="training_example", # adding a second job to the first example
default_arguments=[500]) # default arguments can be overridden later

Above, the default_arguments lets us send 500 as the max_training_iter argument in train_iris_lr. If the training function accepted multiple arguments, we'd send multiple values in the list of default_arguments.

The default_arguments parameter must be a list of numbers or strings, and there should be one item in the list for each parameter into the training function.

Running a job and waiting for its results

In addition to creating jobs from the notebook, you can also run jobs and fetch their results. We'll run the train_iris_lr job from the previous example:

job_run_request = mb.run_job(deployment_name="training_example", job_name="train_iris_lr")
job_run_request.wait() # will block until job completes

# then fetch the updated LogisticRegression
mb.get_model("iris_model")

The above job ran with the default arguments of 500. We can also run it with different arguments, in this case 1000:

job_run_request = mb.run_job(deployment_name="training_example", job_name="train_iris_lr", arguments=[1000])
job_run_request.wait()

# then fetch the updated LogisticRegression
mb.get_model("iris_model")

Additional parameters for mb.add_job

There are several parameters you can use to customize the behavior of your job.

tip

For a full list of parameters to mb.add_job, check out the API Reference.

schedule="cron-string"

Modelbit jobs can be run on any schedule you can define with a cron string. You can also use the simpler schedules of hourly, daily, weekly and monthly:

mb.add_job(my_job, deployment_name="training_example", schedule="daily")
# or
mb.add_job(my_job, deployment_name="training_example", schedule="0 0 * * *")

refresh_datasets=["dataset-name"]

Jobs usually require fresh data to retrain their models. Using the refresh_datasets parameter tells Modelbit to refresh the datasets used by the job before executing the job:

mb.add_job(my_job, deployment_name="training_example", refresh_datasets=["leads"])

size="size"

If your job requires more CPU or RAM than the default job runner you should use a larger runner. Set the size parameter to one of the sizes from the runner sizes table:

mb.add_job(my_job, deployment_name="training_example", size="medium")

email_on_failure="your-email"

Modelbit can email you if your job fails. Just set the email_on_failure parameter to your email address:

mb.add_job(my_job, deployment_name="training_example", email_on_failure="you@company.com")