Data Cleaning and Preparation
1. Data Cleaning and Preparation
In this chapter, we will tackle cleaning and preparing data for analysis.2. ETL
We can consider data cleaning and preparation as part of "Transform" in our overall Extract/Transform/Load (or ETL) process.3. ETL
Data cleaning and preparation is a key part of the ETL process enabled by Excel Power Query. It involves identifying and addressing issues ranging from missing values to duplicates and inconsistencies in the raw data to ensure it is accurate and suitable for analysis.4. ETL
Some areas we will focus on in this chapter include column formatting, data filtering and sorting, and error handling.5. ETL
As well as continuing data loading and interplay between the Power Query editor and Excel Workbooks.6. ETL
We will also introduce Data Preview features, one of Power Query's key enabling features to assist in the Data Cleaning and Preparation process.7. Importance of clean data
Why is it important to clean and prepare data? First, it ensures accuracy and reliability. Data containing errors, missing values, or duplicates can lead to incorrect analytical conclusions. Data cleaning helps identify and correct these issues. Second, it improves analysis efficiency and effectiveness. Clean data streamlines the analysis process and ensures data is organized and formatted correctly. Data cleaning also enhances consistency and comparability. This ensures that data from different sources or periods are consistent and comparable. Lastly, it fosters integrity. As a data analyst, working with properly cleaned data fosters the trustworthiness of your analysis and enhances the value of the insights derived.8. Missing Data
Deciding how to handle missing data depends on your specific data analysis goals and the nature of your dataset. When you have missing or blank data, as in the sample table shown here, you must consider how important or relevant the missing data is for your analysis.9. Missing Data
Secondly, you will also need to consider the volume of missing data. For instance, if missing data represents only a small percentage of your dataset and doesn't significantly impact your analysis, it may be acceptable to delete those rows. In this case, it would mean deleting three rows of data in this example table.10. Missing Data
Consider the source and whether missing data is due to entry errors. If it's due to systemic errors, you might consider addressing the root cause rather than deleting the data.11. Missing Data
For instance, you may opt to impute the missing cell values with the overall column average.12. Outliers and entry errors
Outliers are data points that deviate significantly from the majority of values in the dataset. These could have derived from typos (entry errors) or represent a true value that happens to be on the extreme end of the distribution. How best to handle outliers depends largely on how they affect your analysis. In some cases, outliers may provide valuable insights or indicate specific trends. You might choose to keep them if they are meaningful to your analysis. If you decide to replace outliers, appropriate imputation methods can be used. For example, you can replace outliers with the mean, median, or other statistical techniques. As with handling missing data, the most appropriate method of imputation can differ greatly depending on the business case and overall context.13. Duplicate Data
Lastly, addressing duplicate data is an important data-cleaning step for the following reasons. Duplicate data can cause inaccuracy in analytics. This distorts numerical summaries, such as counts and averages, resulting in misleading insights. Duplicate data occupies additional storage space, which can be costly, especially in large datasets. Lastly, data integrity can be compromised by introducing inconsistencies and errors. This example shows that the row of data for "item 3" has been entered twice.14. Duplicate Data
Assuming we knew that the data was entered twice in error, our solution would be to remove the duplicated line by filtering it out.15. Let's practice!
We reviewed the importance of data cleaning and preparation and some of the issues you might encounter when cleaning your dataset. Thankfully, Power Query has some excellent features that can help easily clean up and prepare our data. Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.