Columns with multiple values
1. Columns with multiple values
A common issue with messy data is having multiple values stored in a single column.2. Two variables in a single column
We already saw examples of this in the previous lesson, when two variables were stored in a single column. In the Netflix data, the issue was in the duration column which held both the value and unit of a movie or TV show duration.3. Converting separated columns' data types
the solution looked like this. Setting the convert parameter to TRUE caused the function to check if it could convert the newly created columns to a numeric data type. It found that the value column actually contains integers instead of strings.4. dplyr aggregation recap
Once the data is numeric we can start analyzing it. For example, by grouping on the type of media and unit of duration, we can learn the average duration of both movies and TV shows.5. Separating variables over columns
The initial messy situation of the dataset could be visualized like this. And the tidy version after separating the duration column like so.6. Combining multiple columns into one
Separating a messy column is a fairly common operation, but there are situations where you want to do the opposite and combine multiple columns into one. Take for example this Star Wars data sample with given names and family names for 4 characters. What if we want just a single name column?7. Combining multiple columns into one
This can be done with tidyr's unite() function. We pass it the new column name as a string, "name" in this case, and then pass the columns we want to unite. The result is a tibble with just one column. Note that the unite() function added an underscore as a separator between the two inputs.8. Combining multiple columns into one
We can overwrite this default by specifying the sep argument. We'll set it to a whitespace here.9. Multiple values in a single cell
Let's switch back to a use case where we want to disentangle a messy column. In the data sample shown here, we have the name of a drink in the first column and its ingredients in the second. This second column is what makes this data untidy. Just like in the Netflix data, it has multiple values per cell. But something is different. Let's visualize this data to show this more clearly.10. Multiple values in a single cell
Like the Netflix movie duration column, the ingredients column has multiple values in a single cell. Only this time, the different values don't correspond to different variables.11. Multiple values in a single cell
like they did in the Netflix data.12. Multiple values in a single cell
Instead, they are different observations of ingredients for a certain drink and should be put in different rows. Note that this allows us to deal with different numbers of ingredients per drink elegantly.13. Separating values over rows
We can achieve this result with the separate_rows() function. We pass it the messy column name and a string to use as a separator. In this case, a comma followed by a space. Now that our data is in a tidy format we can easily use it for further analysis.14. Counting ingredients
We can count the number of ingredients per drink by calling dplyr's count() function on the drink column. Or we can pass it the ingredients column to count how often ingredients are used over all drinks.15. Visualizing ingredients
It also becomes straightforward to visualize the ingredients per drink with ggplot2.16. Let's practice!
Now it's your turn, let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.