Navigating HTML

1. Navigating HTML

Now it's time to learn how to exploit the hierarchic nature of HTML for scraping. Let us quickly go over the basic rules of that hierarchy.

2. HTML is like a tree

You might know a tree structure from other aspects of computer science. If not, that's no problem. In fact, the tree data structure is like an actual tree turned upside down. There's always a root node. In this case, it's the html tag. The root has branches that lead to other nodes, or in the language of HTML, to children. In the example here, the html tag only has one child, the body tag. From now on, I'll use the terms "element", "node", and "tag" interchangeably – they all mean the same.

3. HTML is like a tree

In contrast, the two div tags, which designate general-purpose sections of a web page, are both children of the body tag. At the same time, they are siblings. The first div contains a paragraph, which doesn't have another HTML tag as a child. Instead, it only contains plain text. Here, we speak of a text node, or in tree terms, of a leaf.

4. HTML is like a tree

The second div directly contains text, but there's an a tag within. Technically, the text node and the a tag are siblings too. However, the text is a leaf, while the a tag is not a leaf – as it still contains text. For demonstration purposes later in this video I also included a paragraph at the end that is not enclosed by a div.

5. Navigating the tree with rvest

With rvest, it's easy to quickly traverse the tree to select the nodes you're interested in. One simple function is html_children(). It takes an HTML document or a so-called node set as input and returns its children, or more specifically, an xml_nodeset. The same thing can of course also be written like this, in the familiar Tidyverse notation. Now you can extract the text of all these children with the html_text() function. In this case, the html document only has one child – the body tag – but html_text() extracts all the text that is inside that child, even though it's further down the tree.

6. Navigating to nodes with selectors

While html_children() is a nice shortcut, the function you're probably going to use the most when working with rvest is html_elements(). html_element() without the plural is a special case that only returns the first node that matches your selection. html_elements() does not only take an html document, but also a so-called selector. This is a string that specifies a path through the html tree. For example, you could select the text of specific tags in this tree by specifying the name of the respective tags. The selector also adheres to a specific syntax, for example, the descendant syntax. To select only the text of paragraphs that are children of a div, you write "div" followed by "p", with a blank in between.

7. Navigating to nodes with selectors

Note that only writing "p" would select all the paragraphs in the whole html document. By the way, there's usually more than one way to reach your desired nodes. In this case, for instance, you could just as well select all the div nodes with a first html_elements() call. This returns a subset of the tree. Then you could append another html_elements() call that just selects the p elements within this subtree. And so forth.

8. Extracting attributes

Another helpful function from the rvest package is html_attr(). With it, you can extract the attributes from an HTML element, for example the href attribute from a link. The plural form html_attrs() returns all attributes of an element as a named vector.

9. Let's do this!

Okay, let's scrape some specific content from a page – only by navigating the HTML tree.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.