Session Ready
Exercise

A content extractor

In the previous exercises, you have established that all the elements from the URLs vector you were given return a 200 status code. Now that you know that they are accessible, you will dig deeper into the web scraping, by doing some content extraction.

To do this, we'll use functions from the rvest package, which will be prefilled with partial(). The functions we will write in this exercise will extract all the H2 HTML nodes from a page — on a webpage, these H2 nodes correspond to the level 2 headers. Once we have extracted these titles, the html_text() function will be used to extract the text content from the raw HTML.

purrr and rvest has been loaded for you, and the urls vector is available in your workspace.

Instructions
100 XP
  • Start by prefilling the html_nodes() with css = "h2".

  • Combine this newly created function between read_html and html_text, to create a text extractor for H2 headers.

  • Run this function on the urls vector, and name the result.

  • Print the result to see what it looks like.