Een content-extractor

In de vorige oefeningen heb je vastgesteld dat alle elementen uit de vector URLs die je kreeg de statuscode 200 retourneren. Nu je weet dat ze bereikbaar zijn, ga je dieper in op webscraping door content te extraheren.

Daarvoor gebruiken we functies uit het pakket rvest, die we vooraf invullen met partial(). De functies die we in deze oefening schrijven, halen alle H2-HTML-nodes van een pagina op — op een webpagina komen deze H2-nodes overeen met koppen van niveau 2. Als we deze titels hebben geëxtraheerd, gebruiken we de functie html_text() om de tekstinhoud uit de ruwe HTML te halen.

purrr en rvest zijn voor je geladen en de vector urls is beschikbaar in je werkruimte.

Deze oefening maakt deel uit van de cursus

Gevorderd functioneel programmeren met purrr

Cursus bekijken

Oefeninstructies

Begin met het vooraf invullen van html_nodes() met css = "h2".
Combineer deze nieuw gemaakte functie tussen read_html en html_text om een tekstreekstractor voor H2-koppen te maken.
Voer deze functie uit op de vector urls en sla het resultaat op met een naam.
Print het resultaat om te zien hoe het eruitziet.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Prefill html_nodes() with the css param set to h2
get_h2 <- ___(html_nodes, ___)

# Combine the html_text, get_h2 and read_html functions
get_content <- ___(___, ___, ___)

# Map get_content to the urls list
res <- ___(urls, ___) %>%
  set_names(___)

# Print the results to the console
___

Code bewerken en uitvoeren

Deze oefening maakt deel uit van de cursus

Gevorderd functioneel programmeren met purrr

SkillTag.level.intermediateSkillTag.label

4.8+

Begin de cursus gratis

Do lambda functions, mappers, and predicates sound scary to you? Fear no more! After refreshing your purrr memory, we will dive into functional programming 101, discover anonymous functions and predicates, and see how we can use them to clean and explore data.

Exercise 1: purrr basics - a refresher Exercise 2: Refreshing your purrr memory Exercise 3: Another purrr refresher Exercise 4: Introduction to mappers Exercise 5: Creating lambda functions Exercise 6: Lambda functions Exercise 7: Using mappers to clean up your data Exercise 8: Clean up your data with keep Exercise 9: Split up with keep() and discard()Exercise 10: Predicates Exercise 11: What is a predicate?Exercise 12: Exploring data with predicates

Ready to go deeper with functional programming and purrr? In this chapter, we'll discover the concept of functional programming, explore error handling using including safely() and possibly(), and introduce the function compact() for cleaning your code.

Exercise 1: Functional programming in R Exercise 2: Everything that happens is a function call Exercise 3: Identifying pure functions Exercise 4: Tools for functional programming in purrr Exercise 5: Safe iterations Exercise 6: Create a function Exercise 7: Using possibly()Exercise 8: A possibly() version of read_lines()Exercise 9: Everything in one call Exercise 10: Handling adverb results Exercise 11: Purrrfecting our function Exercise 12: Extracting status codes with GET()

In this chapter, we'll use purrr to write code that is clearer, cleaner, and easier to maintain. We'll learn how to write clean functions with compose() and negate(). We'll also use partial() to compose functions by "prefilling" arguments from existing functions. Lastly, we'll introduce list-columns, which are a convenient data structure that helps us write clean code using the Tidyverse.

Exercise 1: Waarom schonere code?Exercise 2: Hoe schrijf je compose()Exercise 3: Terug op kantoor Exercise 4: Functies bouwen met compose() en negate()Exercise 5: Bouw een functie Exercise 6: Tel de NA's Exercise 7: Argumenten vooraf invullen bij functies Exercise 8: Een content-extractor

Huidige oefening

Exercise 9: Nog een extractor Exercise 10: Lijstkolommen Exercise 11: Over list-kolommen Exercise 12: Maak een data.frame met een lijst-kolom

We'll wrap up everything we know about purrr in a case study. Here, we'll use purrr to analyze data that has been scraped from Twitter. We'll use clean code to organize the data and then we'll identify Twitter influencers from the 2018 RStudio conference.

Exercise 1: Discovering the dataset Exercise 2: Playing with tweets, round 1 Exercise 3: Identify profiles Exercise 4: Extracting information from the dataset Exercise 5: Counting favorites Exercise 6: Extracting mentions Exercise 7: Manipulating URLs Exercise 8: Analyzing URLs Exercise 9: Playing with URLs Exercise 10: Identifying influencers Exercise 11: Splitting the dataset Exercise 12: We have a winner!Exercise 13: Congratulations!