Get startedGet started for free

Cleaning text data

1. Cleaning text data

Now that we've gotten to grips with cleaning and preparing date and time data, let's step it up a gear and look at text data.

2. Cleaning text data

To understand why analyzing text data is so difficult, consider the following survey question: who is your favorite DataCamp instructor? Survey answers can be capitalized differently, can be incomplete or unclear, or contain typos or additional whitespace. Our job when cleaning text data is to ensure that equivalent entries are correctly classified and in a format that can be easily processed during analysis.

3. Changing cases - PROPER()

We saw that different capitalizations is a common issue in text data, so we'll begin by looking at functions to change the case of text. Proper case is where the first letter of every word is capitalized, and every other letter is lowercase. This is often the case we'll want to use when storing names and addresses. The PROPER function will convert text into proper case.

4. Changing cases - LOWER()

Likewise, the LOWER function converts text to lowercase,

5. Changing cases - UPPER()

and the UPPER function to upper case.

6. Removing whitespace

Extra whitespace is also a really common issue in text data. It usually falls into three categories: leading space proceeds the text, trailing space follows the text, and repeated space has more than one space between characters. The TRIM function will remove all of this extra space in one go!

7. Combining text data

One of the most popular operations performed on text fields is concatenating, which joins text together in a new cell. For example, it's common to have first names and last names in separate columns, but it can be useful to combine them. CONCATENATE can do just that! Let's try the function out on some first and last names.

8. Combining text data

If we call CONCATENATE with the first name and last name cell references,

9. Combining text data

the result is a full name with no space in the middle. This is because CONCATENATE pastes together the exact contents of the cells, neither of which contain a space.

10. Combining text data

Because CONCATENATE can accept any number of strings, we can fix this by passing a string containing a space between the two references. This will take the contents of A2, paste a space to the end, and finally, add the contents of B2.

11. Combining text data

This produces the desired result,

12. Combining text data

which we can copy to populate the remaining cells.

13. Combining text data - email addresses

Let's combine a little of everything we've learned so far in this video. Imagine we need the email addresses of each astronaut, which we know take the form: first name-dot-last name, both lowercase, at nasa-dot-com.

14. Combining text data - email addresses

We begin our concatenation with A2, the first name, converted to lowercase using LOWER. The LOWER function here is referred to as a nested function, as it is being used inside another function. Nested functions are evaluated from the inside out. So, in this case, A2 is made lowercase before being concatenated.

15. Combining text data - email addresses

Then, we add a string containing the dot with no spaces.

16. Combining text data - email addresses

Next up is the last name, again, converted to lowercase.

17. Combining text data - email addresses

Finally, we can add the at-nasa-dot-com as a string.

18. Combining text data - email addresses

There we have it!

19. Combining text data - email addresses

Copying the formula allows us to extract the emails for each astronaut on the list. With only six names, we could have written the emails manually, but imagine if we had 100,000 names, this is where using CONCATENATE would be a real timesaver!

20. Let's practice!

Time to clean-up some text data!