Extract names with context
Let's take out our dataset about Swiss politicians again. It consist of two variables: articles
which is a collection of news articles about Swiss politics and politicians
which is a vector with several names of Swiss politicians.
You already counted the number of occurrences per name, but wouldn't it be interesting if you could not only count the names but also see in what context the names are used? You could for example compare whether the contexts differ from female to male politicians. To do so, you'll have to extract the text surrounding our politician names.
As the text contains word characters \\w
as well as punctuation [:punct:]
like periods .
or commas ,
, you will have to create a pattern that matches both of these character types.
This exercise is part of the course
Intermediate Regular Expressions in R
Exercise instructions
- Use the vector
politicians
and collapse it to create an "or pattern" like you did in chapter 2. - Create a custom pattern in square brackets
[]
that matches both word characters as well as punctuations. - Using glue, add the newly created
context
both in front of as well as after thepolit_pattern
. The\\s?
indicated that after there can be a space or no space after the politician names.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create our polit_pattern again by collapsing "politicians"
polit_pattern <- glue_collapse(___, sep = "|")
# Match one or more word characters or punctuations
context <- "([___[___]]+\\s){0,10}"
# Add this pattern in front and after the polit_pattern
polit_pattern_with_context <- glue(
"{___}({polit_pattern})\\s?{___}"
)
str_extract_all(
articles$text,
pattern = polit_pattern_with_context
)