Skip to main content

mb.get_dataset(dataset_name, ...)

Fetches a dataset and returns it as a pandas DataFrame. Datasets can optionally be filtered and used as feature stores.

Parameters

  • dataset_name: str The name of the dataset.
  • filters: Optional[Dict[str, ...]] If supplied with a filters dict, the DataFrame returned will be filtered to rows matching the filter criteria. See the next section for the formats that the filters can take.

Filter syntax

  • Single-value equivalence: { "my_column": 4 }. Returns all rows where my_column=4
  • Multiple-value equivalence: { "my_column": ["a", "b", "c"] }. Returns all rows where my_column is either "a", "b", or "c"
  • Greater than and less than: { "my_column": { "<": 4 }}. Returns all rows where my_column < 4. Available operators:
    • <: Less than
    • <=: Less than or equal to
    • >: Greater than
    • >=: Greater than or equal to
    • =: Equals

Returns

pandas.DataFrame

Examples

Get all rows in a dataset

Returns all rows in the customer_features dataset.

similar_customers = mb.get_dataset("customer_features")

Get rows matching a certain value

Returns the row(s) in customer_features where the CUSTOMER_ID=52.

similar_customers = mb.get_dataset("customer_features", filters={"CUSTOMER_ID": 52 })

Get specific rows

Returns all rows in customer_features where the REGION column is either NA or SA and the EMPLOYEE_COUNT column is either 100-500 or 500-5000.

similar_customers = mb.get_dataset(
"customer_features",
filters={
"REGION": ["NA", "SA"]
"EMPLOYEE_COUNT": ["100-500","500-5000"]
}
)

Get rows greater or less than a value

Returns events where DWELL_TIME is greater or less than a certain value.

# using greater than
events = mb.get_dataset("website_events", filters={ "DWELL_TIME": { ">": 5 } } )

# using less than or equal to
events = mb.get_dataset("website_events", filters={ "DWELL_TIME": { "<=": 100 } } )

Get rows between a range of values

Returns events with DWELL_TIME between 5 and 100:

events = mb.get_dataset("website_events", filters={ "DWELL_TIME": { ">": 5, "<=": 100 } } )

See also