Get startedGet started for free

Regular Expressions

1. Regular Expressions

A possibly dreaded yet very important aspect of R are regular expressions. But what is a regular expression?

2. Regular Expressions

Well, it's nothing more than a sequence of characters and metacharacters that form a search pattern which you can use to match strings. You can use a regular expression to check whether certain patterns exist in a text, to replace these patterns with other elements or to extract certain patterns out of a string. Regexes are particularly handy when you want to clean your data. You'll often turn to regular expressions to make your data ready for further analysis, especially when you're working with data from the web or from different sources. A comprehensive discussion of regular expressions could be a DataCamp course by itself, so I won't go into too much detail here. First, I'll talk about the grepl() and grep() functions. Next, I'll go after the sub() and gsub() functions.

3. grepl()

Have a look at this vector of character strings that represent some animals. With the grepl() function, we can determine, for example, which of these animals has an "a" in their name. The first argument of grepl() is the pattern, while the second is the character vector where matches are sought. In our case, we're looking for the pattern "a", because we want to find the animals that have an "a" in their name. The x argument is equal to animals, the vector of animal names. The results makes sense. There is an a in "cat", so a TRUE value signals that this pattern was found. In "moose", on the other hand, there is no "a", so the corresponding element is FALSE. Matching simply for "a" is great, but we can do much more with regular expressions.

4. grepl()

What if we want to match for strings that start with an "a"? We can use the caret metacharacter here. If we change our pattern from "a" to "caret a", we see that only "ant" is matched, because it's the only name from the animals vector that begins with an "a". Just as the caret matches the empty string at the beginning of a line, the dollar sign matches the empty string at the end of a line. So if we want to match for animals that end with an a, we can use an a followed by a dollar sign. This time, only "impala" is matched. There are many other metacharacters that I will not discuss here. If you want to learn more about them, you can check out the documentation on regular expressions in R by typing question mark regex in the console.

5. grep()

Apart from the grepl function, there is also the grep function. This function returns a vector of indices of the elements of x that yield a match. That's quite different from grepl. Compare the grepl command, that gives a vector of logicals, with the grep command. However different, they are obviously related: grep simply gives the indices of the TRUE elements that the grepl function returns. One way in which you could compare grep and grepl would be using the which() function

6. grep()

given a logical vector as input, this function returns the indices for which that vector is TRUE. If you try it out with our animal-matching attempts, you will get a familiar answer. This is precisely the output of the grep function. Of course, grep knows how to handle the different types of regular expression patterns just as grepl does. The pattern to match for strings which start with an "a" in combination with the grep function returns only 4, the index of "ant" inside animals.

7. sub(), gsub()

We now have covered some basics on how to check for the existence of patterns inside a vector of character strings. R, however, also provides some functions to directly replace these matches with other strings. I'm talking about the sub function. It basically takes three arguments: pattern, replacement, and x. Once again the pattern argument corresponds to the regular expression you want to match strings. x is the character vector where these matches are sought. Finally, you assign a replacement value for the matches to the replacement argument. To see how this works, let's see what happens if we set the pattern argument to "a", matching all characters "a", and replacement to "o". As before, x is simply equal to animals. As we'd expect, the "cat" string gets converted to "cot", so the "a" is replaced with an "o". In "moose" there were no "a"'s so nothing got replaced. In "impala" however, there are two "a"'s, but only the first "a" has been replaced with an "o". How come? Well, that's because the sub() function only looks for the first match in the string, and if it finds it, replaces it with the replacement argument, and immediately stops looking. If you want to replace every single match of a pattern in a string with the replacement argument, you should try the gsub function instead. Now, "impala" gets converted to "impolo", so the two "a"'s have been replaced.

8. sub(), gsub()

There is one last metacharacter I want to discuss, the vertical bar or the OR metacharacter. Its meaning is quite similar to the or operator you learned about to combine logicals. You can use it to match for different options.

9. sub(), gsub()

Say, for example, you want to replace every "a" or "i" with an underscore for the animals character vector. It is straightforward to use the pattern "a" vertical bar "i" inside gsub. Now, all a's and i's have been replaced by an underscore. Of course you can extend this pattern even further.

10. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.