1. Data Cleaning Techniques in Excel Power Query
Let’s go ahead and launch power query and review the drug administrations dataset.
We will want to ensure that the data preview features for columns are all selected, so let’s go ahead and do that.
Scanning through the columns, we can see that there are a couple of columns with errors. Medication administration date and dose administered. We can quickly spot the errors by noticing the column header bar shows red and hashed lines, where the red portion represents the percentage of total rows with errors.
It is important to note that errors like these can often block you from doing further query steps. For instance, say we wanted to just ignore these errors and keep duplicates to check if there is any duplicated data. We notice right away that this DataFormat error pops up to block us from completing this task.
We can back out of that by clicking the “x” next to the ‘kept duplicates’ step we just attempted to execute in our query applied steps.
So, let’s address those columns with errors. Power Query has a few handy options for handling errors. One option is to just delete the error; another is to replace it with another value. Let’s assess what is causing the error before deciding how to proceed.
In the medication administration date column, right-clicking on the column profile and selecting ‘keep errors’ will reveal any rows of data that are impacted. In this case, we can see the date is a data entry error (it should be 2021-01-15).
Note that since this applied step kept only the errors, it actually filtered out majority of data in our dataset. If we close and load from here it would only load what is displayed here.
We can go ahead and replace this with the correct value by backing up out of the kept error applied step and selecting Replace Errors with the correct value. Right away we can see that the column distribution is now visible, and errors are at 0%.
We can do the same thing for dose-administered. In this case, this value “d” is not clear, and we might just wish to delete these errors if we do not have any context about what the values were meant to be. Let’s go ahead and back out of the ‘kept errors’ step and delete those errors. It's important to note that when you select the "remove errors" option, Power Query will delete the entire row of data.
Now that the errors have been resolved let’s go ahead and assess duplicates by clicking the grid icon on the top left corner of the query and selecting keep duplicates.
We can see that there was quite a bit of duplicated data (142 rows total). We can tell by scanning through the rendered data that the ID and contents are duplicated. Since we don’t want to keep the duplicates, we can now go back and remove the duplicates.
Doing a quick scan of “empty” cells from the column quality display, we can also see where we might need to deal with missing values.
Let’s go ahead and check out the “dose_unit” column. We can see there are 2% empty cells, and sorting in ascending order reveals which rows are impacted. Let’s say that we had a specific directive to replace the empty cells with a known value; we can right-click on the empty cells and enter the new value to be replaced. In this case, leaving the Value to Find as blank and Replace With, with “mL”. We can see now that there are no more cells identified as “empty” in this column.
Okay, now it’s your turn!
2. Let's practice!