1. Batching
Welcome back!
We're now going to dive deeper into the rate limits error: handling this error will help us maximize the requests to the API while minimizing delays and avoiding failed responses. Let's get started!
2. What are rate limits
Rate limits in an API are like traffic regulations on a busy highway. Just as traffic lights, speed limits, and lane restrictions are in place to ensure smooth and safe traffic flow, rate limits regulate the flow of data between users and the API. By avoiding single users from making excessive requests, rate limits can prevent malicious attacks, as well as ensuring a balanced distribution between users within an organization.
3. How rate limits occur
A rate limit error can be due to either
the number of requests in a given timespan being too high, meaning there are too many requests being sent,
or the number of tokens in the requests exceeding certain limits, meaning too much text has been included in the request.
4. Avoiding rate limits
Some solutions to avoid these limits include
performing a short pause between requests, setting the function to retry in case limits are hit.
If the frequency of requests is hitting the rate limit, multiple requests can be sent in batches at more staggered time intervals: this is referred to as 'batching'.
If the tokens limit is hit, the number of tokens can be quantified and reduced accordingly.
5. Retrying
When sending requests that might go above the rate limit due to their high frequency, we can set our function to automatically retry in case the limit is hit.
One way to approach this is by adding a retry decorator using Python's Tenacity library.
A decorator is a way to slightly modify the function without changing its inner code, and the retry decorator is used to control the extent to which the function should be run again when failing.
6. Retrying
The wait parameter can be configured through the wait_random_exponential() function.
Using the exponential backoff mode is a way to automatically retry requests with gradually increasing delays from a minimum, in this case of 1 second, to a maximum value, in this case of 60 seconds. The stop parameter can be specified using the stop_after_attempt() function and specifying the maximum number of tries.
To use the decorator we'll have to wrap our response request as a function, such as in this example where we have a get_response() function returning the message.
7. Batching
If the rate limit is due to the timing of the requests and not the number of tokens, one way to avoid it is to send the requests in batches.
In this example, we are asking the API to return the capital city for each of the countries given,
through a list of dictionaries passed to the API. Notice that in this case we have specified precise instructions as the system message in order to get a full answer with all three responses.
8. Batching
This is a much more efficient approach than looping through the Chat Completions endpoint, passing one country per iteration.
9. Reducing tokens
Another way to avoid rate limit errors if the time frequency of requests is not an issue, is to reduce tokens. Tokens can be thought of as chunks of words that constitute 'units' of a word.
One way to measure tokens in Python is to use the tiktoken library:
this way we can first create the encoding using the 'encoding_for_model' function and selecting the model we are using, and then count the tokens in the prompt,
such as in the sentence we have in this example,
using 'encode', and obtaining the total number using the 'len()' function. Each OpenAI model has different limits to the number of tokens that it can handle in input, and this also constitutes a way to check that the prompt is below those limits.
10. Let's practice!
And now let's have a look at some use cases to apply the techniques we have just explored!