1. Telling who you are with custom user agents
So far, you might have asked yourself why you need to know such a low-level concept like the HTTP protocol, if there are packages like rvest which do the job for you?
One big advantage of working directly with the httr package is that you can customize your requests. In this lesson, you'll explore how.
2. Show yourself!
If you send a letter like in the good old days, you usually write down your name (and maybe your address), either on the envelope or at the end of your letter/postcard. You can actually do the same thing when sending requests. Of course, the receiving web server already registers your IP address, but a better way is to explicitly tell the web server your name, perhaps an e-mail address, and the purpose of the request.
It's not something you'd do when normally surfing the web, of course. But when scraping a page intensively, it is actually good practice. If the owners of the web server notice an unusual spike in traffic, it might be helpful for them to know who they can contact. Who knows, maybe they'll send you the data in a well-structured form or provide you with an API, so you don't have to scrape their page anymore.
Probably the best way to identify yourself is through HTTP request headers. You already saw some headers like the Host header, telling the request where to go. There's also the User-Agent header, which usually tells the server on the other end your device and browser. Often, web servers use this information for site analytics, such as finding out the share of people using a certain browser. Since you are in full control of your requests, you can alter that User-Agent header and turn it into an identification field.
3. Modify headers with httr
There are two ways you can do that with httr: Either you specify it directly with the user_agent() function that you pass to a request. Note that this only modifies the User-Agent header of the current request.
Or, if you want to apply a certain User-Agent header to each and every request you make, for example within a script, you can use the set_config() function. In order to globally set the User-Agent header to the string you desire, you need to pass it a call to the add_headers() function which takes a number of named arguments. Each named argument is the name of the header, followed by its value. Note that you need to put User-Agent within backticks, as it contains a dash. If you do this, all of the subsequent requests to a URL will have this additional header. If there already is a User-Agent header, it will be overridden.
4. Let's try this!
Okay, let's try this out!