1. Cleaning text data
In the last lesson of this chapter, we'll talk about different ways to address dirty text data.
2. What is text data?
Before we dive in, let's take a look at some examples of text data. Names, phone numbers, email addresses, and passwords are all text data.
Text data is very common, but it can be difficult to work with since it can be unstructured.
3. Unstructured data problems
Because text data doesn't usually have a consistent structure, there are a number of problems that we can run into while working with it.
The first is formatting inconsistencies, since there are often multiple ways of formatting the same information. For example, phone numbers can be written in a variety of ways with or without spaces, parentheses, hyphens, and other punctuation, and credit card numbers can be written with or without spaces.
Information inconsistency happens when different data points offer different amounts of information. For example, one phone number could include a country code, while others may not, or one person might fill in a "name" field using their first and last name, while another might use only their first name.
Data entered can also be invalid, such as a phone number with only 4 digits or a zip code that doesn't exist.
4. Customer data
To learn about addressing these problems, we'll look at an example dataset of customers that contains customer names, companies, and credit card numbers. Notice that some rows have spaces in the credit card number while others have hyphens.
Dirty text data like this can interfere with pipelines and processes that rely on this data. For example, sales software might only be able to process credit card numbers that are consistently formatted.
5. Detecting hyphenated credit card numbers
To clean up this text data, we can use functions from the stringr package.
Before we can clean this data, we'll need to find which values need cleaning. This can be done using the str_detect function, which takes in a character vector, in this case the name column of the customers data frame, and the pattern that you want to detect, which is a hyphen. This will return a logical vector indicating whether a hyphen is found in the credit_card column in each row.
6. Replacing hyphens
Now that we've identified where our issues lie, we can use the str_replace_all function, which takes in the column of text data, the string we want to replace, and the replacement string.
In this example, we want to replace all of the hyphens in the credit_card column with spaces so that all of the credit card numbers have consistent formatting.
7. Removing hyphens and spaces
An alternative solution is to remove the hyphens and spaces from the credit card numbers so that they contain numbers only. This can be done using the str_remove_all function.
Here, we take the credit_card column, remove all hyphens, and then remove all spaces. We can add this to our data frame using mutate.
8. Finding invalid credit cards
Now that we've removed all the hyphens and spaces, all of the credit card numbers should have exactly 16 numbers.
We can find invalid credit cards using the str_length function, which returns the length of each string in a column. str_length can be used in combination with a filter to find all the customers whose credit_card number does not contain exactly 16 characters.
9. Removing invalid credit cards
We can remove these invalid numbers from the dataset by filtering for rows that have a credit card with a length of 16.
Now we'll be able to charge customers with ease!
10. More complex text problems
To deal with more complex text data problems, regular expressions can be used. A regular expression is a sequence of characters that allows for robust searching within a string. For example, we could search for all credit cards that have a 4 as their first digit.
In regular expressions, there are certain characters that get treated differently. All the stringr functions we learned about use regular expressions, so when searching for or replacing one of these special characters, the fixed function needs to be wrapped around the text, like this.
We won't discuss regular expressions any further, but check out these courses to learn more about them.
11. Let's practice!
Time to practice cleaning some text data!