Get startedGet started for free

Data preparation and regex

1. Data preparation and regex

In the previous lesson, we started cleaning and tidying data. But we have a few more steps before we'll get to plotting.

2. Handling long names

Let's take a look at the names of our response variables. These are going to be the labels for the bars on the final graph. But right now, they're too long. How can we take out that extra text so each entry is just about the behavior? We can return to our handy stringr function, str_remove(), which allows us to remove those words. But there are a few variations of the extraneous text. In some cases, the entry starts with "In general" and in others ends with "on a plane."

3. Regex

To solve this problem, we can use a tool called regular expression, or regex. Regular expressions are language for describing patterns in strings and they can be used in any computing language. They allow us to find instances of general patterns. For example, regex is how a website knows if you've entered a valid email address; it matches the text you entered to a pattern of what an email should look like. Regex can get very complicated, but we'll just focus on two characters in the language. If you want to learn more, check out DataCamp's available courses on string manipulation and regular expressions. The first is the dot. A dot matches any character. That means it will match a number, a letter, a punctuation mark - anything you can type, the dot will match. Let's try it out. If we detect the dot pattern in the string "happy," the answer is true. It is also true that the pattern h followed by a dot is in "happy," because we have h followed by an a, and the dot matches the a. But we do not detect the pattern y followed by a dot, because there is nothing after the y in happy.

4. Regex

The second character we'll cover is the asterisk. The asterisk says "match the character before me zero or more times." For example, let's take the string "Statistics is the best". How could we remove everything except the word "best"? We can frame the problem as "removing everything up to the word best." We'll use the str_remove() function and combine the dot and the asterisk. By using regex, we can specify we want to remove anything up to and including the word followed by a space.

5. Let's practice!

Let's try using our new regex skills on the 538 dataset.