Get startedGet started for free

Common data problems

1. Common data problems

In the final video of this chapter, we'll discuss some common data problems.

2. Dirty data

Dirty data is incorrect, incomplete, or inconsistent data. It can be caused by human error, technical issues, or selective data collection. Realistically, starting out with data problems such as these is usually unavoidable. You cannot ignore dirty data as, if not resolved, it can eventually lead to data that is not representative of what we are trying to analyze, leading to flawed analysis and wrong conclusions. Compare it with a dirty window: if it is very dirty, you would not be able to see clearly through and see what's on the other side until you clean it.

3. Data errors

Data errors consist of incorrect or inconsistent data. For example, typing errors or dates in the wrong format. They are typically caused by recording errors. They can easily be resolved if the original value or valid format is known. Otherwise, the data points in question need to be dropped.

4. Missing data

We say the data is missing if some data points are left blank. For example, if you conduct a survey and some respondents do not answer certain questions. Missing data can be especially problematic if many data points are missing or if there are underlying patterns in the missing data. For example, if only older adults or some other group did not answer the survey questions. Depending on the severity of the problem, the data can be dropped or imputed. The latter is a technique that allows us to estimate statistically what the missing values could be.

5. Data bias

The data can also take on real-world characteristics as we use real-world data. This means that societal bias can be reflected in data bias. Like extremely dirty data, severely biased data can lead to unrepresentative data and results. Unfortunately, data bias can be hard to detect and resolve. The best way to counter data bias as much as possible is to ensure a solid data collection process and be aware of potential bias in our conclusions. Lastly, explainable AI techniques can help to more easily detect possible bias during the analysis phase, as they make the output of models more interpretable.

6. Data cleaning

Data cleaning consists of a set of techniques to counter data problems. It is an important preparation step for any analysis, so make sure to allocate the necessary time. Not all data problems are solvable, however. For example, if you can't know the right value of incorrect data. Even if the data is severely compromised due to data problems, it is always possible to do some kind of analysis. For example, perform a descriptive analysis to point out the data problems and use that to improve the data collection process.

7. Let's practice!

Time to practice!