Get startedGet started for free

Data Cleaning and Preparation

1. Data Cleaning and Preparation

In this chapter, we will tackle cleaning and preparing data for analysis.

2. ETL

We can consider data cleaning and preparation as part of "Transform" in our overall Extract/Transform/Load (or ETL) process.

3. ETL

Data cleaning and preparation is a key part of the ETL process enabled by Excel Power Query. It involves identifying and addressing issues ranging from missing values to duplicates and inconsistencies in the raw data to ensure it is accurate and suitable for analysis.

4. ETL

Some areas we will focus on in this chapter include column formatting, data filtering and sorting, and error handling.

5. ETL

As well as continuing data loading and interplay between the Power Query editor and Excel Workbooks.

6. ETL

We will also introduce Data Preview features, one of Power Query's key enabling features to assist in the Data Cleaning and Preparation process.

7. Importance of clean data

Why is it important to clean and prepare data? First, it ensures accuracy and reliability. Data containing errors, missing values, or duplicates can lead to incorrect analytical conclusions. Data cleaning helps identify and correct these issues. Second, it improves analysis efficiency and effectiveness. Clean data streamlines the analysis process and ensures data is organized and formatted correctly. Data cleaning also enhances consistency and comparability. This ensures that data from different sources or periods are consistent and comparable. Lastly, it fosters integrity. As a data analyst, working with properly cleaned data fosters the trustworthiness of your analysis and enhances the value of the insights derived.

8. Missing Data

Deciding how to handle missing data depends on your specific data analysis goals and the nature of your dataset. When you have missing or blank data, as in the sample table shown here, you must consider how important or relevant the missing data is for your analysis.

9. Missing Data

Secondly, you will also need to consider the volume of missing data. For instance, if missing data represents only a small percentage of your dataset and doesn't significantly impact your analysis, it may be acceptable to delete those rows. In this case, it would mean deleting three rows of data in this example table.

10. Missing Data

Consider the source and whether missing data is due to entry errors. If it's due to systemic errors, you might consider addressing the root cause rather than deleting the data.

11. Missing Data

For instance, you may opt to impute the missing cell values with the overall column average.

12. Outliers and entry errors

Outliers are data points that deviate significantly from the majority of values in the dataset. These could have derived from typos (entry errors) or represent a true value that happens to be on the extreme end of the distribution. How best to handle outliers depends largely on how they affect your analysis. In some cases, outliers may provide valuable insights or indicate specific trends. You might choose to keep them if they are meaningful to your analysis. If you decide to replace outliers, appropriate imputation methods can be used. For example, you can replace outliers with the mean, median, or other statistical techniques. As with handling missing data, the most appropriate method of imputation can differ greatly depending on the business case and overall context.

13. Duplicate Data

Lastly, addressing duplicate data is an important data-cleaning step for the following reasons. Duplicate data can cause inaccuracy in analytics. This distorts numerical summaries, such as counts and averages, resulting in misleading insights. Duplicate data occupies additional storage space, which can be costly, especially in large datasets. Lastly, data integrity can be compromised by introducing inconsistencies and errors. This example shows that the row of data for "item 3" has been entered twice.

14. Duplicate Data

Assuming we knew that the data was entered twice in error, our solution would be to remove the duplicated line by filtering it out.

15. Let's practice!

We reviewed the importance of data cleaning and preparation and some of the issues you might encounter when cleaning your dataset. Thankfully, Power Query has some excellent features that can help easily clean up and prepare our data. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.