Using datasets as feature stores
Feature stores are datasets used to provide your models with high-performance access to historical or aggregate data. In many cases, the client calling the deployment doesn't have all of the features that the model needs. Those extra features can be precalculated supplied at inference time from a feature store.
For example, a transaction fraud model might need features about the historical fraud rate for a user. But the server calling of the model only has the user's ID and the current transaction to evaluate. In this case, historical fraud rates could be calculated per user and stored in a dataset. Then, at inference time, the user's historical fraud rate could be fetched from the dataset.
# Fetching the historical fraud features for a certain user
fraud_features = mb.get_dataset("historical_fraud_rates", filters={"USER_ID": 42})
Using feature stores
Using a dataset as a feature store only requires specifying filters
when calling get_dataset
. When filters are applied to a dataset only the matching rows are returned.
Here's an example inference function that gets features from the dataset called customer_features
:
def my_deploy_function(customerId: str) -> float:
# historical_features is a DataFrame filtered to rows matching the CUSTOMER_ID
historical_features = mb.get_dataset("customer_features", filters={"CUSTOMER_ID": customerId})
# Here you'd combine the features in historical_features with any other inputs or business logic
all_features = historical_features + ...
# Then return your inference
return my_model.predict(all_features)
Performance of feature stores
Modelbit stores datasets in small shards so that fetching only a few rows from the dataset can be very fast. When used in a deployment, datasets are cached locally so that most lookups won't even traverse the network. This makes most lookups in a dataset nearly instant.
To improve filtering performance, set the first column in your dataset as the primary key you'll use when filtering. While you can filter datasets by multiple columns, datasets are indexed and sharded by their first column.
Refreshing feature stores
Data stored in datasets won't change unless you refresh the dataset. You can refresh datasets in three ways:
- In the web app: Click the refresh button in the datasets table
- In a training job: Training jobs can refresh datasets before they run their training code
- Using a webhook: You can refresh datasets programmatically by calling an API
After refreshing a dataset, the updates will propagate to running deployments within a couple minutes and to training jobs the next time they start.