1. Finding our perfect match
In this lesson, we'll explore ways to find column names with patterns that aren't necessarily at the beginning or the end.
2. Take a glimpse at world_bank_data
First, take a peek at the world_bank_data to review its column names.
Recall that the glimpse function provides some properties of the data.
3. One way to select
Suppose we'd like to work with the country, year, infant_mortality_rate, fertility_rate, and unemployment_rate columns. We could select them by separating them with commas as shown.
We can see in the output that these five columns are returned in the order specified. But is there a simpler way? Is there some kind of pattern that these five column names have in common?
4. Another way using contains()
One pattern that these five column names have is that they all include the letter y. In fact, of the 12 columns in world_bank_data, these are the only columns that have a y in their name.
If we'd like to search for a literal string appearing in the column names of our data, we can use the contains function. Like the starts_with and ends_with functions, we pass a string to look for as an argument. Here, we want column names that contain the letter y, so we pass the string "y" to contains.
The results here are exactly the same as before when we specified each of the five columns directly, but with much less code.
5. A dip into regular expressions
Regular expressions are a language in themselves in some ways. They contain particular characters that act as tokens. These tokens hold special properties that assist in searching for patterns. We'll work with three different tokens.
The first token is the vertical pipe, which corresponds to "or". If we want to find matches that contain either one string or another, we can separate them with this vertical pipe. We might have used the vertical pipe in R before in a similar one-or-the-other situation.
The next regular expression token is the start of string anchor. This is the caret character, and anything that follows the caret is the string we want to find at the beginning.
The third token is the end of string anchor, the dollar sign character. The dollar sign looks for a string appearing at the end. Regular expressions aren't R-specific but, as I mentioned, a language of their own.
6. matches()
Earlier, we used the contains function to search for a literal string in column names. To work with regular expressions, we must use the matches function instead.
Suppose we wanted columns that contained the letter y or included the string perc. Then we just need to separate the y and perc by a vertical pipe.
Nine columns are returned corresponding to the nine column names that include either a y or perc. Thus, the columns of iso, continent, and region are the three of the twelve that are not selected here.
7. Alternatives to starts_with() and ends_with()
We've already worked with the starts_with and ends_with functions. We can also use the caret and dollar sign regular expression tokens to find matches for the beginning or finish.
The country and continent variables are important non-numerical variables in our data. We can focus on just those two columns using the caret character before the co string since those are the only column names that start with co.
Or maybe we want to focus on the country and column names that end in "on", which is only region in this case. We can pass matches the string "on" followed by a dollar sign.
8. Let's practice!
Check out some matching on the imf_data in the exercises.