1. Manipulating text data
Great job cleaning and preparing your text for analysis. In this lesson, we'll look at a few more advanced techniques for cleaning text data.
2. How long is a string?
So far, we've been using the terms text and string interchangeably, but it's helpful to think of text as a chain of
3. How long is a string?
characters,
4. How long is a string?
symbols,
5. How long is a string?
or spaces, where each has a number that can uniquely identify it: an index. For example, the space in this piece of text is found at index 6. Each part of the text can then be pinpointed using these indexes, and altered if required during the cleaning process.
6. How long is a string?
The LEN function returns the length of a string, which is the same as the final index, eight in this case.
7. Searching for characters
We can search for particular patterns of characters in a string using the SEARCH function.
It takes three arguments: the string to search for, the text to search through, and the index to start searching at, which is one by default - the start of the text.
8. Searching for characters
For example, we can use SEARCH to find the dollar sign in the following text:
we call the function, specifying the character to search for, and the text to search in.
This returns the index seven.
9. Extracting text - LEFT() and RIGHT()
In this lesson, we've seen two functions that return indexes, but what if we want to extract the actual text? This is where the LEFT and RIGHT functions come in.
Both functions take a string and an optional argument for the number of characters to extract, which is one by default.
The LEFT function extracts a specified number of characters starting from the left-hand side of the text.
Similarly, the RIGHT function extracts characters starting from the right-hand side.
10. Bringing it together
The LEN, SEARCH, LEFT, and RIGHT functions work really nicely together, so let's look at an example of combining them.
Consider some location data, where the city and state have been stored in a single column, separated by a comma. Our analysis requires us to extract the state from this location data.
11. Bringing it together
We start with the RIGHT function, as the state we want to extract is to the right of the city.
12. Bringing it together
The first argument of RIGHT is the string to extract from, which is the location.
This is where things get a little tricky, as the length of the state varies for different locations. We need a way to find the number of characters from the right that will allow us to extract any state.
13. Bringing it together
To find the number of characters to pass to the RIGHT function, we need to subtract the total length of the string by the index that the comma is located at, which we can find using SEARCH.
14. Bringing it together
Inserting this subtraction of the total length by the index that the comma is located at completes the formula.
Recall that when we have nested functions, the inside functions are calculated first, so the LEN and SEARCH calculations are performed, then the subtraction, and finally the RIGHT function.
15. Bringing it together
This returns the correct state for the first row,
16. Bringing it together
and when copied downwards, we can see that it handled the longer state names too!
17. Substituting characters
The last common text function we'll discuss is SUBSTITUTE. This takes three required arguments: text to search through, the string to search for, and the string to replace it with.
The optional occurrence_number argument should be specified when the string to search for appears multiple times in the text, to specify which occurrence to substitute.
18. Substituting characters
It's easy to get the SUBSTITUTE arguments mixed up, so specify them as though you've got a flat tire: we start with the car, identify the flat tire, then replace it with the new one.
19. Let's practice!
You've quickly become a cleaning pro! Head on over to the exercises.