Data Cleaning
1. Data cleaning
Let's get our hands dirty and do some cleaning on the data.2. Data cleaning
The raw data that you access are rarely completely clean. Sometimes, they contain information you don't need and want to filter out. Some values might be missing, and they shall be handled. You might need to convert the data type of some columns. And finally, string values might contain characters you want to eliminate.3. Filter columns
Carrying around information that is not needed might be counterproductive. Imagine having a table with a thousand columns, but you are only interested in one or two.4. Filter columns
Filtering them will let you focus on what is important, avoiding carrying irrelevant information through the analysis.5. Filter rows
Similarly, you can isolate the rows that you are interested in, for example, by keeping only the rows that contain certain values.6. Filter duplicates
It is not rare to find the same information repeated multiple times in a table. For a correct data analysis, it is important to identify those rows to improve data quality and avoid overestimation.7. Handle missing values
Contrary to the duplicate values, some values might be missing completely. This might be due to a systematic error in the data collection process or the values not existing at all. In any case, missing values in a table can complicate the data analysis. Thus, make sure to handle them correctly before starting your data analysis.8. Handle missing values
Missing value handling is sometimes similar to crossword filling: you can fill the missing gap by using the surrounding clues, for example using the value of the previous row or the mean value of the column. In other cases, you can fill the gap with a fixed value or, if it is crucial for your data analysis, remove the row completely.9. Data types
Tabular data is organized in rows and columns, and columns always have associated data types to indicate whether the column contains10. Data types
strings,11. Data types
numbers,12. Data types
date and time,13. Data types
Boolean values, and so on... The data type defines the operations that can be performed on the values. You cannot sum two strings, and you cannot insert a whitespace in the middle of a number.14. Convert Data Types
Therefore, sometimes, you need to convert the data type of a column.15. Convert Data Types
Converting a number to string, for example,16. Convert Data Types
will let you add additional characters to it.17. Clean strings
Strings are a sequence of characters, but not all of them are always in the right place.18. Clean strings
Sometimes, strings are in the wrong format, contain the wrong characters, and might need to be standardized for consistency.19. Let's practice!
Now that you have seen this overview of possible data-cleaning steps, test your knowledge with a quiz.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.