1. Introduction to XPATH
Welcome to the third chapter!
In this video, you'll be introduced to XPATH, which is another and even more powerful method to select nodes from an HTML tree.
2. XML Path Language
XPATH stands for XML Path Language. With this language, a so-called path through an HTML tree can be formulated, which is a slightly different approach than the one with CSS selectors. In this example, all p elements with class "blue" that are direct children of div elements would be selected.
With XPATH, one can not only navigate down the HTML tree, but also up again. And it allows you to select nodes based on properties of other nodes.
Through that, more advanced and customized selections of HTML nodes are made possible.
For example, you can select nodes whose children match certain properties. That's something which is very helpful, but not possible with CSS. This justifies learning XPATH in addition to CSS.
3. Selecting nodes
Probably the simplest use case is selecting nodes based on their type anywhere in the tree. For this, you type two forward slashes followed by the type of the nodes you want to select. The equivalent CSS selector would be a "p". The double forward slash means something like "anywhere" in the tree.
A path that gives the same result would be anywhere below a body node that can be anywhere in the tree.
Yet another possible path would be a forward slash, followed by the html node, followed by another single forward slash, followed by the body node, and then the anywhere double forward slash and the p tag. This basically means: Start at the root of the tree, go down one level to the html tag, go down one level to the body tag, and then search for p tags anywhere in the tree below the body tag. Note that you need to always explicitly specify an "xpath" argument if you want to use XPATH with rvest.
4. Selecting specific nodes
Let's say you only want to select p tags that are direct children of div tags. First, you select div nodes anywhere in the tree with a double forward slash, then you specify the child relationship with a single forward slash.
5. Selecting specific nodes
Here's a powerful feature of XPATH: Selecting nodes based on their relations to other nodes. In this case, only the divs that have a nodes as children are selected.
For this, a so-called predicate in square brackets needs to be specified. More to that later. Note that there is no CSS selector equivalent – that's something that's just not possible in CSS.
6. Syntax: axes, steps, and predicates
Here's the general syntax of XPATH: Paths are mainly made of axes and steps. An axis is either one or two forward slashes. It specifies a relationship between nodes. The single forward slash is the child relationship, while the double forward slash is the general descendant relationship.
Steps are in between axes and are made of HTML types.
Predicates are specified within square brackets and declare conditions that must hold true for the type that precedes them.
This example can be read like so: Select all the span nodes in the tree, then go to their direct a-node-children, but only to those that have the class "external".
In a similar fashion, one could go to all elements that have the "special" ID and then navigate to their div descendants anywhere below them. Note that, just as in CSS, the asterisk is the universal selector that applies to all HTML types.
7. Let's practice!
Let's try out some XPATH queries!