tidyr's extract
1. tidyr's extract
Alright, at this point you've become quite skilled with writing regular expressions.2. Functions used so far
So far, all the functions and workflows we've looked at work great if your text lives in a vector or a list. But what if it doesn't come in that format? Very often, the data that we work with, is stored in a data frame. Wouldn't it be great if we also had tools to use regular expressions within our data frame workflow?3. Where regular expressions and data frames meet:
This is exactly where the "extract" function of the "tidyr" package shines. It extracts multiple parts from a text in one column of a data frame.4. The arguments of extract
Let's look at that function definition of extract and its parameters: "data" is our data frame. "col" is the column where our plain text is stored. With "into" we can define how the new columns should be named that we create. "regex" is the regular expressions that we write that matches the different parts within the text. "remove" defines whether we would like to keep the original text column or whether we want to have it removed. And last but not least: When set to true, the "convert" option will make an educated guess about the data types of the data that we extract. If groups contain only numbers, it will convert them accordingly.5. Movies data frame
Here we have a data frame that contains information about the number of screens a movie ran on in Switzerland. But the data is not tabular, just plain text in the column "line". Let's imagine we want to do an analysis of these numbers and there is no way to get it in a structured, tabular format.6. What we can do with str_match
We can get pretty far with what you've learned so far, for example: If we want to identify films that were shown in 3D, we can identify them with "string match" by matching "3D".7. What the result of str_match looks like
With dplyr's "mutate" we save the result of the function call into a new column "is_3d" in the data frame. Movies in 3D will have 3D in the cell.8. str_match can only match one information
But what if we not only want to save the information about 3D but also the numbers of screens these movies were shown on? Then we need to extract two things.9. This is what extract can do for us
tidyr's extract function will help us do just that. Using regular expressions and capturing groups, we can convert text data into tabular data - into a data frame.10. This is what extract can do for us
Let's look at the code that we use for this task: We pass to "extract" our data frame "screens_per_movie" as the first argument and the name of the text column as a second argument. The third argument is the names of the new columns that we want to create, in our example: "is 3d" and "screens". And lastly we define the regular expression "regex" with two "capturing groups", defined by parentheses. The extract function will work only if the number of capturing groups and the number of columns we want to create match.11. The result of extract
In our case we need two capturing groups. One for "is_3d" and one for "screens". We also pass "remove equals false". For now we want to keep our original text column to check whether the extraction worked as expected. Later on we might delete it to prevent our data frame from containing duplicate informations.12. Let's practice!
Still sounds a bit complicated? By the end of this lesson, you will master all the different options, let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.