What is tidy data?
1. What is tidy data?
Hi there, my name is Jeroen Boeye and I'll be your instructor for this course. By the end of it, you'll be able to reshape almost any dataset into a tidy format, which will save you a lot of headaches during the rest of your analysis. But what is this tidy format?2. Tidy datasets are all alike
It's a data structure that is easy to manipulate and visualize and was defined by Hadley Wickham, who happens to be the godfather of the tidyverse collection of R packages. He saw that everybody working with messy data was losing a lot of time fixing similar problems in their own way. He then created the tidyr package to help you go from messy to tidy data. The other packages in the tidyverse are designed around the idea that your data is in a tidy format. They work so well because tidy datasets are all alike, while every messy dataset is just messy in its own way.3. Rectangular data
The tidy data format has a rectangular shape which means it has columns, rows, and cells, just like in a spreadsheet. However, for rectangular data to be tidy, there are three conditions.4. Tidy data, variables
Each column should hold a single variable. In this example, these are the names, homeworlds, and species of different Star Wars characters.5. Tidy data, observations
Each row should hold a single observation. In this example, each movie character is an observation.6. Tidy data, values
If you follow the first two rules, each cell should now hold a single value.7. dplyr recap
Once your data is in a tidy format you can further process it with the dplyr package, which you should have some experience with before starting this course. Let's do a quick recap of the main dplyr functions.8. dplyr recap: select()
You can use the select() function to select a subset of the columns, or variables, in your dataset.9. dplyr recap: filter()
You can use filter() to subset rows, or observations, based on the values of specific variables.10. dplyr recap: mutate()
The mutate() function will let you create new variables or overwrite existing ones.11. dplyr recap: group_by() and summarize()
And finally, the group_by() function will allow you to specify variables by which you want to group the dataset. You can then use the summarize() function to apply aggregations on each group. After summarizing, you'll have only one observation for each group.12. Piping it all together
While each of these dplyr operations is simple, you can use them to build up complex analyses one step at a time. To work in this stepwise manner, the pipe operator comes in very handy.13. dplyr + tidyr
The dplyr package works best on data that's already in a tidy format, but unfortunately, most data in the wild is messy. The tidyr package will help you deal with this. You'll often be using dplyr and tidyr functions in a single pipeline, as these two packages are great extensions to one another.14. Multiple variables in a single column
Let's look at an example of what tidyr can do. The dataset shown here has the population numbers of 4 countries in millions of people. You'll notice that the country column actually combines two variables, country and continent. This is not a tidy format since each column should hold just one variable.15. Separating variables over two columns
We can fix this with tidyr's separate() function. We pass it the column we want to split, country in this case, and a vector with the new column names is given to the into argument. Finally, we also provide a string to separate by to the sep argument, a comma followed by a whitespace in this case. The result is a tidy dataset, ready for further analysis.16. Let's practice!
Now it's your turn, let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.