Tidy Data and Messy Data
What exactly marks the difference between tidy data and messy data? It is not only how organized and intuitive the datasets look to our human eyes, but also how easily and efficiently they can be processed by computers. In his seminal paper Tidy Data, Hadley Wickham proposed three standards for tidy data:
- Each variable forms a column
- Each observation forms a row
- Each type of observation forms a unit.
In this course, we'll focus on the first two rules and show you how we can use the Python package pandas to deal with datasets violating them. To get started, execute messy
in the IPython shell. This dataset, which appears in Wickham's paper, shows the number of people who choose either of two treatments in a hospital. Observe its structure in comparison with Wickham's rules. This dataset is messy because it violates rule #2: it combines Treatment A and Treatment B, two distinct observations, in a single row.
Now let's look at two more datasets. Execute df1
and df2
in your IPython shell to check out two other preloaded datasets, both featured in DataCamp's Cleaning Data in R course. The former shows the type and number of pets owned by three co-workers, and the latter shows the average BMI in three countries over several years. Which one of these datasets is messy, and why?
This exercise is part of the course
Tidy Data in Python Mini-Course
Exercise instructions
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
