Get Started

Finding missing data

1. Finding missing data

We will walk through how to use both Power BI and Python to find columns with missing values. There is at least one, so we will also perform imputations. We will use the fisheries dataset. On a new page, create a new python script visualization. Drag all of the columns to the Values section. Let’s import the required packages. We will need two lists - one holding the column names from the dataset and one with the count of missing values in each column. The ‘dataset’ is a pandas DataFrame with the columns we added to the “Values” section. We will use the columns method on the dataset DataFrame, and another method called “to_list()” to create a list of column names. Save as “cols”. Next, find the missing values per column using the isnull() method on the DataFrame. We use sum() to get the total count. Save as num_missing. Let’s combine these two lists in a new DataFrame using the pandas dot DataFrame function. This function takes a dictionary of items. The first item is called ‘columns’ and is paired with the cols list. The second is called ‘nullCount’ and paired with num_missing. Let’s save this as nullDF Create a bar visualization to show the counts of missing values per column. With the plt dot bar() function, set x to nullDF[‘column’] and height to nullDF[‘nullCount’]. Show our chart with plt dot show(). This will render the plot in the appropriate format for Power BI. Click the “Run Script” button to render the visual. This shows the missing value in the fished column - all in one visualization. For most cases, identifying which columns have missing data is a lot easier with the Column Quality and Profiling features in the Power Query Editor. Let’s open that up. First, we want Column profiling to happen across the whole dataset. In the bottom-left corner, click on the text “Column profiling based on the top 1000 rows”. Choose the other option, “Column profiling based on entire data set”. In the “View” tab, check the boxes for “Column profile” and “Column quality”. Selecting the “fished” column, the Column quality is below the column header. This shows the percentage of Valid, Errors, and Empty values - it is less than 1%. Hovering over this value, we can confirm the exact number is one - which we found earlier. A simple imputation technique is to replace this empty value with the column average, found here in the Column statistics. Do that by right-clicking on the column header and selecting “Replace values”. Enter null for the “Value to Find” and the average, 12742287, for the “Replace With” value. Click OK. We can see the empty value percentage is now 0%. Imputation in Python is more flexible, which is helpful when using different imputation techniques. Let’s remove this previous step and instead try median imputation in Python. Staying in Power Query, click on the ‘fished’ dataset, click “Transform” tab, then “Run Python script”. Create the impute method for each entity using the groupby() function. The column we want to calculate on is fished. Transform can be a complicated function to comprehend. It will apply the function specified, in this case, median, on the DataFrame. Since we grouped by the entity, it will find the median per entity and return a self-produced DataFrame. Save as ‘fill’. On the fished column, we use the fillna() method with our fill variable and inplace set to True. Click OK. Click on the yellow Table text, and voila - the value from 2015, null, is now the median value. Applying the changes, we can see the python visualization will update to show no missing values. Now it’s your turn to find missing data and use imputation!

2. Let's practice!