1. The nature of HTTP requests
In the last chapter of this course, we'll look a bit behind the curtains and see what's at the foundation of scraping:
So-called HTTP requests.
2. Hypertext Transfer Protocol (HTTP)
HTTP stands for Hypertext Transfer Protocol and is a relatively simple set of rules that dictate how modern web browsers, or clients, communicate with a web server.
As shown in this image from the Mozilla Developer Network, a web document or website that contains multiple assets like text, images, and videos, fetches all these resources via so-called GET requests from one or more than server. We'll get to those GET requests in a minute.
3. The anatomy of requests
How does it work? As said, a request is usually very simple. It's often only composed of a so-called method, in this case, GET, a protocol version, and several so-called headers. The most important is probably the host – the address of the resource that is to be fetched.
In turn, the response from the web server tells the client whether the request was successful, which is denoted by the status code and status message. Also, the headers tell the client, your browser, how to deal with that response. Helpful information is for example the Content-Type, which tells the browser which format of content was returned. In this case, it's simple HTML text that now can be rendered in the browser.
Here are some typical status codes: 200 stands for "everything went well" while 404 indicates that the resource was not found on the web server. Codes in the 300-range are so-called redirects, telling you to fetch the resource at a different address. Lastly, codes in the 500-range usually result when there was an error on the server, for instance, when a program crashed because of the request.
4. Request methods: GET and POST
The most common request methods, or at least those that will become relevant when you scrape a page, are GET and POST. GET is always used when a resource, be it an HTML page or a mere image, is to be fetched without submitting any user data.
POST, on the other hand, is used when you need to submit some data to a web server. This most often is the result of a form that was filled out by the user. The POST request has a payload, which follows the headers. In this case, a couple of key-value-pairs with data are submitted. These could be form fields that were filled with value1 and value2, respectively.
Of course, POST requests also result in a response from the server.
5. HTTP requests with httr
With the httr library, you can easily create and send HTTP requests from your R session. With the aptly named GET() function, for example, you can send a GET request. When doing so in the console, the response from the server is printed as soon as it is received.
Here, the request was successful, as you can see from the 200 status code. After the response headers, the actual content of the web site is listed. In this case, it's HTML text.
6. HTTP requests with httr
If you want you can extract the actual content from the response and parse it directly into an HTML document that rvest can work with. It's really straightforward: Just use the content() function on the response from the GET function.
As you can see, the already familiar HTML document results.
7. Let's practice!
Okay, let's fire some HTTP requests!