Text transformation

1. Text transformation

Let’s explore some of the powerful features that Power Query has for manipulating text data. We are going to be working with the AdventureWorks dataset, looking at the customer table: There are 7 columns in this dataset, 5 of which are text columns. First things first, let’s enable our column profile feature and column distribution feature so that we can see if there is anything sneaky going on in our dataset. It looks like there are unique customer names and customer ID’s so that’s good news. If we take a look at the country-region column, there seem to be 3 unique values in the top 1000 rows of the dataset. Let’s enable profiling on the entire dataset so that we are able to get a more holistic view. Now there are 11 distinct Country-Regions in the whole dataset: That’s strange, we’re pretty sure that there should only be 6 different countries. Let’s investigate this further. We’ll use the column profile feature here. Ah! There are several columns where the same country seems to appear as a separate record. I’m going to filter our column to only have these 4 values to see if I can identify why this is happening: I’ll also remove duplicates so that we only get 4 records to compare between. Though they look exactly the same in the table, Power Query sees them as different values, otherwise, they would have been removed in the remove duplicates step. We can find out if there is any trailing or leading whitespace, or other control characters by counting the number of characters, including whitespace, in each record. First, let’s duplicate the column. Then head to extract > length. We can see that the records have different lengths: The correct character length for “United States” should be 13, 12 letters, and one space. So, the culprit here must be trailing and leading whitespace as well as newline characters. Luckily for us, Power Query has two transformations designed to remove these two from our text columns: Trim and Clean. Let’s apply them to the country column: Now our data is in much better shape, with trailing and leading whitespace removed. Let's remove the filter and remove duplicate steps to see the profile of our column after the changes: Perfect, let's move to the State-Province column. If we scroll through the value distribution, we can see that there are two different versions of New South Wales because one of them has a different capitalization: We’ll use the Capitalize First Word transformation to make all data in this column have the same capitalization: We can see that our transformation was successful by comparing the number of records before and after the transformation: Lastly, let's change the Customer column here to have the first name followed by the last name instead of how it is now. To do this, we’ll use Split column, with comma as a delimiter: Now that we’ve split the columns, we can merge them back in the right order by selecting the second column first then holding control and selecting the first column. Then click on the merge columns transformation, selecting Space as a delimiter. Perfect! Our manager is used to seeing the “Customer” suffix attached to customer names to identify them as a customer. Just in case there are any employees that have the same name as a customer. We can do this with the add suffix feature in Power Query: Great! Now you’re ready to clean up and transform some text data yourself! This time, you’ll work on the Resellers table.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.