1. Extracting matches and surroundings from a text
So far, our examples were based on the assumption that we have different lines of text that all have the same structure. But what if that's not the case? Regular expressions can help us in a bunch of different situations. What if we are looking for example for a list of names in a text and want to look at the context of these names.
Well that's what we are going to do in the last lesson of this chapter.
2. Mentions of a company name
Let's imagine we have a collection of blog posts and want to find out if and how people talked about our fictitious company "ABC Enterprises". How could we achieve this?
Well, we could for example say: We would like to extract our company name and ten words before it and ten words after it. Using character classes we can define the following pattern for one word. A word consists of one or more word characters followed by a space. So the pattern for one word looks like this.
Using curly braces, we can define how many words we want to look for. We could for example match up to ten words by adding zero, comma ten in curly braces. This repetition definition will be applied to the whole group, so everything within the parentheses.
If we now take this pattern and put it in front of and behind our company name we have a pattern that matches our company name plus 0 to 10 words. If we save the text in the variable "blog post" and search it with the "str extract all" function, we will get the following result: The text is stripped down to that part of the blog post that contains "ABC Enterprises" plus some words on both sides of the name.
3. Punctuation
You might wonder why it only extracts 4 words in front and 7 after the company name, when in fact the text had much more words on both sides. This is, because we haven't added any punctuation to our match, so when a period or an exclamation mark is encountered, the pattern no longer matches and will get cut off. In our case this is not a big problem as we are interested in the sentence that contains the company name "ABC Enterprises".
But if we do not want that to happen, we could replace our word character class with a custom pattern using square brackets. In the square brackets we could then put our word character class and the named class "punct".
The result will then look like this: You can see that now we extract ten words on both sides of the company name and also match periods and exclamation marks.
4. Let's practice!
Alright, let's apply this to a real dataset. Let's practice.