Skip to main content

Rate limiting inferences

Modelbit runs your inferences rapidly across autoscaling hardware to minimize latency and maximize throughput.

However, sometimes maximum throughput may not be what you want. If your deployments call external APIs those APIs may impose rate limits, which can surface as errors in your deployment.

Modelbit's rate limit feature lets you to slow down the rate of inferences sent to your deployment, so that calls to external APIs can stay under the limits imposed by those APIs.

Creating a rate limit

To create a rate limit, head to Settings and choose Rate Limits, then click New Rate Limit.

Specify a name for your rate limit and the maximum inferences allowed per minute.

Apply a rate limit to a deployment

In the Environment tab of a deployment, choose the rate limit you want applied to this version. The rate limit will activate within one minute.

To remove a rate limit, change the limit to Off in the Environment tab.

Using rate limits

Rate limits are global. All deployments and versions across all branches using the same named limit share the same limit quota. If two deployments are using the same rate limit, inferences sent to either deployment will count against the limit.

Consumption of limits are calculated per-minute. If your rate limit is 100, only the first 100 inferences received that calendar minute will be sent to your deployment to run. The remainder will be queued (if using the async API) or rejected with a 429 status code.

Batches of inferences are kept together and not split across calendar minutes. So if your rate limit is 100, and you send two batches of 75 inferences, the first batch will run in the first minute and the second batch will run in the second minute.