CommencerCommencer gratuitement

The shortcomings of html_table() with badly structured tables

Sometimes, you only want to select text that's a direct descendant of a parent element. In the following example table, however, the name of the role itself is wrapped in an em tag. But its function, e.g. "Voice", is also contained in the same td element as the em part, which is not optimal for querying the data.

Here's an excerpt from the HTML code:

<table>
 <tr>
  <th>Actor</th>
  <th>Role</th>
 </tr>
 <tr>
  <td class = "actor">Jayden Carpenter</td>
  <td class = "role"><em>Mickey Mouse</em> (Voice)</td>
 </tr>
 ...
</table>

In this exercise, you will try and scrape the table using a known rvest function. By doing so, you will recognize its limits.

The roles_html variable contains the document with the table.

Cet exercice fait partie du cours

Web Scraping in R

Afficher le cours

Instructions

  • Try to extract a data frame from the table with a function you have learned in the first chapter.
  • Have a look at the resulting data frame.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Extract the data frame from the table using a known function from rvest
roles <- roles_html %>% 
  html_element(xpath = "//___") %>% 
  ___()
# Print the contents of the role data frame
___
Modifier et exécuter le code