Get startedGet started for free

Scrape your first table

1. Scrape your first table

Now that you have a basic understanding of HTML, let us look at some of the most important elements when it comes to scraping: the table.

2. A simple table

In its most basic form, a table may consist of only three different HTML tags: table, tr, and td. The table tag designates a table, as the name says. The tr tag designates rows and is wrapped around multiple td tags, which designate single cells. Normally, the number of td tags in each row should be identical. However, there is the colspan attribute for td, that allows a cell to span multiple columns.

3. A table with a header row

There are a lot more tags that can be used to better describe and structure a table, which also makes scraping easier. For example, the header cells that usually contain the column names can be specified explicitly with the th tag. Usually, browsers will render this row in bold. For scraping, a designated header row is an advantage, as only the actual data rows can be queried.

4. Scraping a table with rvest

Apart from using functions like html_element() and html_text(), rvest provides a helper function called html_table(). The biggest advantage of this approach is that it converts the table data into a tibble on the fly. Of course, html_table() works best if the desired table has a clean structure and makes use of the more semantic tags like th. Note that the output of html_table() in this case is a list with one element, being the only table in the html document. If there were more than one table in the document, the list would have more entries.

5. Scraping a table with rvest

However, if that is not the case, html_table() has a couple of extra arguments that you can pass. For instance, it can be explicitly told to regard the first row as the header row, even if it only consists of td tags.

6. Scraping a table with rvest

Note that html_table() automatically fills empty cells with NA values. Let's assume that the country in the second row is empty in the original HTML. As you can see, html_table() replaces it with an NA.

7. Scraping "tables" in reality

Whenever a website developer uses a nicely structured table syntax to show tabular data, scraping it with R is pretty straightforward. However, in my work as a data journalist, I've come across a lot of examples like the one shown here: Instead of actual table elements, the developer uses more generic HTML tags like divs to render a table. The actual look and feel of the table are then specified with CSS, a complementary style definition language. These styles are referenced with the "class" attribute of the div tags here. In such cases, a scraper needs to make use of more advanced selectors. We will dive deeper into these in the remaining of this course – I hope you'll stay on board!

8. Let's practice!

For now, let's scrape some tables with rvest!