Data preparation
1. Data preparation
Great job so far! I'm Hadrien, and I will now tell you about data preparation.2. Data workflow
Data preparation happens after collecting and storing the data.3. Why prepare data?
Data rarely comes in ready for analysis. Real-life data is messy and dirty. It needs to be cleaned. Skipping this step may lead to errors down the way, incorrect results, or throw off your algorithms. You would not use vegetables without cleaning, peeling and dicing them, as your soup would taste weird and no one would eat it. Well, if you don't clean, peel and dice your data, your results will look weird, and no one will use them!4. Let's start cleaning
Let's take a simple, but dirty, dataset, and clean it together. Maybe you can already notice a few things.5. Tidy data
One fundamental aspect of cleaning data is "tidiness". Tidy data is a way of presenting a matrix of data, with observations on rows and variables as columns. This is not the case here. Our observations (people) are in columns, and their features are on rows. Let's take care of that. It's easy to do that programmatically with Python or R. They also help with the other cases you're about to see.6. Tidy data output
The data looks much clearer this way.7. Remove duplicates
In general, you want to remove duplicates. Python and R make them easy to identify. Here we can see that Lis appears twice.8. Remove duplicates | output
Let's remove the duplicate.9. Unique ID
What if there's another person called Lis? Then, you want a way to uniquely identify each observation. It can be a combination of features (name plus last name plus year of birth, for example),10. Unique ID | output
but the safest way is to assign a unique ID. Sara's ID is now 0, Lis' 1 and Hadrien's 2.11. Homogeneity
Something fishy is going on in the size column. Lis simply can't be that tall (or Hadrien and Sara that small, depending where you're from). Lis is in the US, she inputted her size in feet. Sara and Hadrien are based in Europe, they use the metric system. All variables should use the same standard.12. Homogeneity | output
Programmatically, you can filter values above 2.5 meters, and apply a division by 3.281 to get the metric value. Here we go.13. Homogeneity, again
Similarly, countries should follow the same format. The United States and France are abbreviated, but Belgium is written in full. Let's fix that.14. Homogeneity, again | output
Looking better already!15. Data types
Another common issue relates to data types. The tools you use might be able to infer data types for each column, but you'd better make sure they are correct. Here, the Age column is encoded as text. If you try to get the mean, you'll get an error, because the average of two words doesn't make sense. You should change the type of this feature to numbers.16. Data types | output
Ages are now numbers; you can see the quotes have disappeared.17. Missing values
Last but not least, missing values. They are common and occur for various reasons: the agent doing the entry was distracted, the person surveyed did not understand the question, or it's on purpose, for example an event that has not happened yet. There are several ways to deal with missing values. You can substitute the exact value if you have access to the source. For example, you can take an aggregate value, like the mean, median or max depending on the situation. You can drop the observation altogether, but each observation you remove means less training data for your model. Or, you can keep it as is and ignore it, if your algorithm allows it.18. Missing values | output
Here, we take the mean, 27.5, and round it up to get 28, which happens to be the correct value.19. Let's practice!
Let's check your understanding!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.