Get startedGet started for free

Select directly from a parent element with XPATH's text()

In this exercise, you'll deal with the same table. This time, you'll extract the function information in parentheses into their own column, so you are required to extract a data frame with not two, but three columns: actors, roles, and functions.

To do this, you'll need to apply the specific XPATH function that was introduced in the video instead of html_table(), which often does not work in practice if the HTML table element is not well structured, as it is the case here.

For your reference, here's again an excerpt of the table HTML:

<table>
 <tr>
  <th>Actor</th>
  <th>Role</th>
 </tr>
 <tr>
  <td class = 'actor'>Jayden Carpenter</td>
  <td class = 'role'><em>Mickey Mouse</em> (Voice)</td>
 </tr>
 ...
</table>

In this exercise, the roles_html variable contains the HTML document with its table element.

This exercise is part of the course

Web Scraping in R

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Extract the actors in the cells having class "actor"
actors <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "actor"]') %>%
  html_text()
actors

# Extract the roles in the cells having class "role"
roles <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "___"]/___') %>% 
  ___()
roles
Edit and Run Code