Crash Course in XPath

1. Crash Course X

At this point we've run through the basics of HTML, with some wordy ways to describe how to navigate to particular elements. However, if we want to describe where these elements are within our programs (programs made to navigate and scrape HTML), then we need to build up a standard, program-friendly language or syntax to do so. You have noticed that all your exercises at this point have been multiple choice, and this is because we didn't have the necessary understanding of how to turn our wordy navigation of HTML into a variable for the computer to ingest. That changes now. This lesson will give a crash course in some basics of what's called XPath notation, one of two common choices for this purpose. And, in the next chapters, we will go deeper into both syntaxes with many more examples.

2. Another Slasher Video?

Jumping right in, a simple XPath string we could write in python is given here. One nice property of XPath notation is that you might already have some familiarity with similar syntaxes, because it uses a single forward slash in an analogous way as you do if you are navigating directories, or typing a URL into your browser. The single forward-slash moves us forward one generation. In fact, if we think of the tag-names as the "directory" names, then these simple XPaths will look very much like navigating between directories. What might seem unfamiliar are the brackets. These brackets are used to help specify which element or elements we want to direct to. For example, there could be several div elements which are children of the body element (that is, several div siblings), so, we can use the brackets to narrow in on the div element we want.

3. Another Slasher Video?

To illustrate the sample XPath string we wrote in the last slide, here we have highlighted the div element which would be selected within a tree representation of some HTML. Notice that the number 2 in the brackets of our XPath expression refers to the second div element of the three div elements (ordered from top to bottom as usual), paying attention to the fact that the first child of the body is a span element, so is not counted when looking at the div elements.

4. Slasher Double Feature?

Another important feature of XPath notation is the double forward-slash. Using the double-forward slash tells us to "look forward to all future generations" (instead of just one generation like the single forward-slash). So, for example, we could navigate to all table elements within an HTML document by simply typing double forward-slash table. Or, we could want to restrict to a specific div element (say, the one we learned how to navigate to in the last couple slides), and navigate to all table elements which are descendants of that div element.

5. Ex(path)celent

And, that's it for now! We have only just scratched the surface of XPath notation, but we've gone deep enough that you can begin to write some code and get your feet wet navigating HTML computationally. Let me emphasize here that XPath is general, meaning that it is not python specific. So, if you decide to start scraping the web in R, say, most libraries there will also be able to read and interpret your XPath strings.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.