Get startedGet started for free

How to be gentle and slow down your requests

1. How to be gentle and slow down your requests

Besides telling who you are with custom user agents or other HTTP headers, another thing you can do is throttle your requests. This greatly reduces the load on the scraped website.

2. Don't try this at home!

Throttling becomes relevant if you are scraping a lot of pages in succession. For example, the whole history of a certain Wikipedia page, if you'd want to compare edits made over time. In this toy example here, I wrote an infinite loop of requests to httpbin-dot-org. Without throttling, the requests are emitted as fast as possible, meaning that the next request will be fired as soon as the previous one has returned a response from the server. Within a couple of minutes, hundreds if not thousands of requests left my computer.

3. A nicer way of requesting data from websites

Most websites, especially popular ones, will only feel a tickle and have no problems handling such an amount of requests. However, it is still good practice to apply some sort of "cool down", for example, a second, between each request. Now, how to do this? There are several approaches. In this lesson, we will focus on the slowly() function from purrr, a very helpful package available in the Tidyverse.

4. A tidy approach to throttling

With the slowly() function, you can generate a throttled version of any function. It's as if you create a copy of a function, but with a built-in time delay. The first argument you need to give the slowly() function is a tilde sign followed by the function you want to throttle. In this case, the GET() function that calls httpbin-dot-org is throttled. The second named argument, rate, takes another function from purrr: rate_delay(). With this one can define the time delay between requests in seconds. In this case, I'd like to have a time delay of three seconds. Now, when I call the modified throttled_GET() function within a while-loop, it only gets executed every third second – and not as fast as possible, as is usually the case.

5. Query custom URLs in a throttled function

One caveat of the previous code is that the throttled_GET() function will only ever call httpbin-dot-org, because I hard-coded that URL into it. If you want your throttled function to take any URL as an argument, you have to supply a dot as an argument within the slowly() call. Now, the throttled GET() function can request different URLs in each call, for example to Wikipedia.

6. Looping over a list of URLs

This is a necessity if you want to loop over a list of URLs, which is usually the case if you're crawling stuff from a range of websites. Here, I defined a list of custom URLs at httpbin-dot-org and I used a for-loop to request each of them in succession. Within the loop body, I can pass the url variable which holds the current URL to my throttled GET() function. With this technique, you could crawl a list of Wikipedia URLs, for example.

7. Let's apply this to a real world example!

And that's just what you will do in the following exercises. Let's go!