Extract nodes based on the number of their children
As shown in the video, the XPATH count()
function can be used within a predicate to narrow down a selection to these nodes that match a certain children count. This is especially helpful if your scraper depends on some nodes having a minimum amount of children.
Here's an excerpt from a page (without any classes or IDs…) that you might be scraping:
...
<div>
<h1>Tomorrow</h1>
</div>
<div>
<h2>Berlin</h2>
<p>Temperature: 20°C</p>
<p>Humidity: 50%</p>
</div>
<div>
<h2>London</h2>
<p>Temperature: 15°C</p>
</div>
<div>
<h2>Zurich</h2>
<p>Temperature: 22°C</p>
<p>Humidity: 60%</p>
</div>
...
You're only interested in div
s that have exactly one h2
header and at least two paragraphs, because your application can't really deal with incomplete weather forecasts.
The above HTML is available to you via forecast_html
.
This exercise is part of the course
Web Scraping in R
Exercise instructions
- Select the desired
div
s with the appropriate XPATH selector, making use of thecount()
function.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Select only divs with one header and at least two paragraphs
forecast_html %>%
html_elements(xpath = '___')