Completing data with all value combinations
1. Completing data with all value combinations
In the previous lesson, we saw how to create a data frame with all unique combinations of two variables. Using a two-step approach, you could join this data back to your original dataset allowing you to add missing observations or to highlight which observations were missing in the first place. Turns out that for adding missing observations, there is actually a more elegant solution.2. Rolling Stones and Beatles
Let's consider this data sample with the number of live albums released by both the Beatles and Rolling Stones in a short period in the seventies. There is no observation for the Rolling Stones in 1979, nor is there one for either band in 1978. This could cause problems when we plot this data.3. Initial and target situation
We could visualize the situation like so. We have two discrete or categorical variables, year and artist, and one counted variable n_albums. We want to get to a situation where there is an observation for all combinations of the first two variables.4. Initial and target situation
And we might even want to expand to values not yet seen in the data, like the year 1978.5. The complete() function
This is where tidyr's complete() function comes in, we pass it the variables for which we want each possible value combination to become an observation, year and artist in this case. As a result, the Rolling Stones get an observation in 1979 with an NA value for the number of albums.6. The complete() function: overwriting NA values
We can overwrite these NA values within the complete() function using the fill argument. Just like the replace_na() function, fill expects a list of variable names set equal to what you want to overwrite the NA values with. Here, we set n_albums to zero.7. The complete() function: adding unseen values
To add values that were not yet seen in the data, all you have to do is set the variable name equal to a vector of all the values you want to complete it with. In this example, we've set artist equal to not just the Beatles and Rolling Stones, but also ABBA. As a result, this band is now added to the output.8. The complete() function: adding unseen values
We can do the same thing for the year variable. We specify a range of years to complete with from 1977 till 1979. As a result, the year 1978 is now included too. However, there is a downside to specifying a range of values like this. First, is that you have to manually inspect the data for the lowest and highest value present. Second, is that when you re-use your code on an updated or different dataset, you might get unexpected results.9. Generating a sequence with full_seq()
This is where the full_seq() function can help. It will look for the lowest and highest value within the vector you pass it, and will then return a sequence between these two values. The interval between the values in the sequence can be set with the period argument, it's set to one in this example. Because it looks for min and max values first, the function is not affected by duplicate values in the input. This allows us to pass it a variable from our data frame, giving us the same range of years that we would have specified manually.10. Using full_seq() inside complete()
We can elegantly plug full_seq() into the complete function like so, giving us the same result as before, but now more robust to data changes.11. Generating a date sequence with full_seq()
You can also apply the full_seq() function on dates like shown in this example. The period argument then corresponds to the number of days between dates in the output.12. Let's practice!
That's it for this lesson, now it's your turn to practice.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.