Skip to main content

get_dataset

Fetch a dataset and return it as a Pandas DataFrame. Datasets can be filtered and used as feature stores.

Parameters

mb.get_dataset({dataset_name}, ...)
  • dataset_name: str The name of the dataset.
  • filters: Optional[Dict[str, ...]] If supplied with a filters dict, the DataFrame returned will be filtered to rows matching the filter criteria. See the next section for the formats that the filters can take.

Filter syntax

Dataset filter syntax supports several condition formats:

  • Single-value equivalence: { "my_column": 4 }. Returns all rows where my_column=4
  • Multiple-value equivalence: { "my_column": ["a", "b", "c"] }. Returns all rows where my_column is either "a", "b", or "c"
  • Greater than and less than: { "my_column": { "<": 4 }}. Returns all rows where my_column < 4. Available operators:
    • <: Less than
    • <=: Less than or equal to
    • >: Greater than
    • >=: Greater than or equal to
    • =: Equals

Returns

pandas.DataFrame

Examples

Get all rows in a dataset

Returns all rows in the customer_features dataset.

similar_customers = mb.get_dataset("customer_features")

Get rows matching a certain value

Returns the row(s) in customer_features where the CUSTOMER_ID=52.

similar_customers = mb.get_dataset("customer_features", filters={"CUSTOMER_ID": 52 })

Get specific rows

Returns all rows in customer_features where the REGION column is either NA or SA and the EMPLOYEE_COUNT column is either 100-500 or 500-5000.

similar_customers = mb.get_dataset(
"customer_features",
filters={
"REGION": ["NA", "SA"]
"EMPLOYEE_COUNT": ["100-500","500-5000"]
}
)

Get rows greater or less than a value

Returns events where DWELL_TIME is greater or less than a certain value.

# using greater than
events = mb.get_dataset("website_events", filters={ "DWELL_TIME": { ">": 5 } } )

# using less than or equal to
events = mb.get_dataset("website_events", filters={ "DWELL_TIME": { "<=": 100 } } )

Get rows between a range of values

Returns events with DWELL_TIME between 5 and 100:

events = mb.get_dataset("website_events", filters={ "DWELL_TIME": { ">": 5, "<=": 100 } } )

See also