A content extractor
In the previous exercises, you have established that all the elements from the URLs vector you were given return a 200 status code. Now that you know that they are accessible, you will dig deeper into the web scraping, by doing some content extraction.
To do this, we'll use functions from the rvest
package, which will be prefilled with partial()
. The functions we will write in this exercise will extract all the H2
HTML nodes from a page — on a webpage, these H2
nodes correspond to the level 2 headers. Once we have extracted these titles, the html_text()
function will be used to extract the text content from the raw HTML.
purrr
and rvest
has been loaded for you, and the urls
vector is available in your workspace.
This exercise is part of the course
Intermediate Functional Programming with purrr
Exercise instructions
Start by prefilling the
html_nodes()
withcss = "h2"
.Combine this newly created function between
read_html
andhtml_text
, to create a text extractor forH2
headers.Run this function on the
urls
vector, and name the result.Print the result to see what it looks like.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Prefill html_nodes() with the css param set to h2
get_h2 <- ___(html_nodes, ___)
# Combine the html_text, get_h2 and read_html functions
get_content <- ___(___, ___, ___)
# Map get_content to the urls list
res <- ___(urls, ___) %>%
set_names(___)
# Print the results to the console
___