Get startedGet started for free

Data Cleaning

1. Data cleaning

Let's get our hands dirty and do some cleaning on the data.

2. Data cleaning

The raw data that you access are rarely completely clean. Sometimes, they contain information you don't need and want to filter out. Some values might be missing, and they shall be handled. You might need to convert the data type of some columns. And finally, string values might contain characters you want to eliminate.

3. Filter columns

Carrying around information that is not needed might be counterproductive. Imagine having a table with a thousand columns, but you are only interested in one or two.

4. Filter columns

Filtering them will let you focus on what is important, avoiding carrying irrelevant information through the analysis.

5. Filter rows

Similarly, you can isolate the rows that you are interested in, for example, by keeping only the rows that contain certain values.

6. Filter duplicates

It is not rare to find the same information repeated multiple times in a table. For a correct data analysis, it is important to identify those rows to improve data quality and avoid overestimation.

7. Handle missing values

Contrary to the duplicate values, some values might be missing completely. This might be due to a systematic error in the data collection process or the values not existing at all. In any case, missing values in a table can complicate the data analysis. Thus, make sure to handle them correctly before starting your data analysis.

8. Handle missing values

Missing value handling is sometimes similar to crossword filling: you can fill the missing gap by using the surrounding clues, for example using the value of the previous row or the mean value of the column. In other cases, you can fill the gap with a fixed value or, if it is crucial for your data analysis, remove the row completely.

9. Data types

Tabular data is organized in rows and columns, and columns always have associated data types to indicate whether the column contains

10. Data types

strings,

11. Data types

numbers,

12. Data types

date and time,

13. Data types

Boolean values, and so on... The data type defines the operations that can be performed on the values. You cannot sum two strings, and you cannot insert a whitespace in the middle of a number.

14. Convert Data Types

Therefore, sometimes, you need to convert the data type of a column.

15. Convert Data Types

Converting a number to string, for example,

16. Convert Data Types

will let you add additional characters to it.

17. Clean strings

Strings are a sequence of characters, but not all of them are always in the right place.

18. Clean strings

Sometimes, strings are in the wrong format, contain the wrong characters, and might need to be standardized for consistency.

19. Let's practice!

Now that you have seen this overview of possible data-cleaning steps, test your knowledge with a quiz.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.