Get startedGet started for free

Columns with multiple values

1. Columns with multiple values

A common issue with messy data is having multiple values stored in a single column.

2. Two variables in a single column

We already saw examples of this in the previous lesson, when two variables were stored in a single column. In the Netflix data, the issue was in the duration column which held both the value and unit of a movie or TV show duration.

3. Converting separated columns' data types

the solution looked like this. Setting the convert parameter to TRUE caused the function to check if it could convert the newly created columns to a numeric data type. It found that the value column actually contains integers instead of strings.

4. dplyr aggregation recap

Once the data is numeric we can start analyzing it. For example, by grouping on the type of media and unit of duration, we can learn the average duration of both movies and TV shows.

5. Separating variables over columns

The initial messy situation of the dataset could be visualized like this. And the tidy version after separating the duration column like so.

6. Combining multiple columns into one

Separating a messy column is a fairly common operation, but there are situations where you want to do the opposite and combine multiple columns into one. Take for example this Star Wars data sample with given names and family names for 4 characters. What if we want just a single name column?

7. Combining multiple columns into one

This can be done with tidyr's unite() function. We pass it the new column name as a string, "name" in this case, and then pass the columns we want to unite. The result is a tibble with just one column. Note that the unite() function added an underscore as a separator between the two inputs.

8. Combining multiple columns into one

We can overwrite this default by specifying the sep argument. We'll set it to a whitespace here.

9. Multiple values in a single cell

Let's switch back to a use case where we want to disentangle a messy column. In the data sample shown here, we have the name of a drink in the first column and its ingredients in the second. This second column is what makes this data untidy. Just like in the Netflix data, it has multiple values per cell. But something is different. Let's visualize this data to show this more clearly.

10. Multiple values in a single cell

Like the Netflix movie duration column, the ingredients column has multiple values in a single cell. Only this time, the different values don't correspond to different variables.

11. Multiple values in a single cell

like they did in the Netflix data.

12. Multiple values in a single cell

Instead, they are different observations of ingredients for a certain drink and should be put in different rows. Note that this allows us to deal with different numbers of ingredients per drink elegantly.

13. Separating values over rows

We can achieve this result with the separate_rows() function. We pass it the messy column name and a string to use as a separator. In this case, a comma followed by a space. Now that our data is in a tidy format we can easily use it for further analysis.

14. Counting ingredients

We can count the number of ingredients per drink by calling dplyr's count() function on the drink column. Or we can pass it the ingredients column to count how often ingredients are used over all drinks.

15. Visualizing ingredients

It also becomes straightforward to visualize the ingredients per drink with ggplot2.

16. Let's practice!

Now it's your turn, let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.