Skip to main content

Rate limiting inferences

Modelbit runs your inferences rapidly across autoscaling hardware to minimize latency and maximize throughput.

The Modelbit platform enforces rate limits to manage the aggregate load of inferences across all the deployments running on the platform. Rate limits apply to in-flight requests. Requests that have been sent to Modelbit but have not yet completed are in-flight and count against this limit. If you've reached the limit you'll be able to send more requests once an in-flight request finishes.

RateLimitExceeded: Concurrency request count limit reached

If you get a Concurrency request count limit reached error it means that you've reached the limit for total count of in-flight REST requests.

This concurrency limit applies to the total number of REST requests, not the number of inferences within each request. If you're seeing this error it may be because you're sending many REST requests with a single inference in each. You can switch to the batch inference API to dramatically increase your throughput since the batch API can handle thousands of inferences in each REST request.

RateLimitExceeded: Concurrency request size limit reached

If you get a Concurrency request size limit reached error it means that you've reached the limit for the total bytes of in-flight REST requests. This limit is calculated by looking at the number of bytes in the POST bodies of all in-flight request. If you're seeing this error consider sending smaller request bodies. For example, if you're encoding large files in the API request you could instead send links to the files and download them within the deployment.