Data preview features

1. Data preview features

In this demo, we will learn how to use Power Query data preview features to help us diagnose and fix some common errors that appear in data. The dataset we are working with today is a sample of 99 movie titles from the popular IMDB movie rating database. We have 13 columns of various data types in this dataset, but not all of these columns are going to be useful for analysis in their current state. We can use the data preview features to inform us about the state of our columns and help us decide which columns need fixing. Let’s activate the column distribution feature first by going to the View ribbon. This is a handy feature for summarizing the contents of a column, particularly, how many unique, and how many distinct values are in that column. The first thing we notice here is that the “language” column has only 1 distinct value and 0 unique values. Meaning that all of the movies in this sample are in English. Since it doesn’t add much information, we can remove this column, as long as we make sure to inform our report users that this dataset contains only English movies. Second, we can observe that the color column has 4 distinct and 2 unique values. We can clearly see that there are some missing data points in this column, so we should investigate this a little further. Let’s enable the column quality feature. We can see there are 11 missing data points in this column. This is usually a problem, but you can also notice that almost all of the non-empty observations fall under one value. This means there is not a lot we can learn from this column. Once again, we can remove the column and mention it to our audience that we are only analyzing Color movies. We can also use the column distribution feature to detect whether there are duplicate records in our dataset. It is expected that the number of distinct director names is more than the number of unique director names, meaning there are some directors that worked on multiple different movies. However, the movie titles should all consist of unique records, so that every movie is represented exactly once in this sample. Since that is not the case, we should remove duplicates from the movie_title column. Now our dataset has no repeated movie titles, leaving us with 91 unique movies. Next, we’ll dig a little deeper into specific columns to find and fix some other errors. Let’s enable Column Profile. We’ll check on the duration column. We can see that the Minimum and Maximum are quite extreme, there’s a movie that’s almost 11 hours long, as well as a movie that somehow has a duration of negative 50 minutes. So let’s add a filter to our column to remove these extreme values. I will choose to exclude movies that are shorter than 60 minutes, and longer than 4 hours. Now our column statistics look much more realistic. Finally, let’s check the title_year column. We can clearly see that there are some values that are incorrect, as movies weren’t invented until 1895! Here we can make use of the value distribution part of the column profile and replace these values by right-clicking the bars. Then it’s simply a matter of replacing it with the correct value. Although keep in mind it is always better to do replacements in the source file rather than in Power Query, this is because if multiple rows have the same value, they will all be replaced with the value you specified, which will sometimes not be your intent: We’ve now cleaned up the title_year column. It’s now your turn to apply the data preview features to help transform your data in Power Query.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.