Get startedGet started for free

Creating unique combinations of vectors

1. Creating unique combinations of vectors

In the first chapter, we learned how to deal with missing values. We had rows, or observations, in our data that did not have a value for one or several variables. But what if not just values are missing but full observations?

2. The early atomic era: 1945 - 1954

Let's look at an example. In this data sample from the Nuclear Explosions Database, we have an overview of the number of nuclear bombs detonated per country in the first decade of the atomic era. The dataset only has observations for countries that did effectively detonate bombs in that specific year. For example, for the Russian Federation, which was the USSR back then, there are no observations for the year 1945. There are several ways to expand the dataset to include these observations using tidyr.

3. The expand_grid() function

Let's first look at the expand_grid() function. The function will create a tibble with all possible combinations of the vectors that you pass it. For the country column, we pass a vector with the Russian Federation, United Kingdom, and United States. These were the only nuclear powers back then. For the year column, we pass a range from 1945 till 1954. The result is a tibble with 3 observations per year, one for each country. This tibble can help us change the original data sample, so we'll save it as full_df.

4. right_join() with a tibble of unique combinations

To add the missing observations to our original data sample, we can perform a right join on the country and year columns. For combinations of country and year that are in full_df but not in nuke_df, an observation will be added with an NA value for n_bombs.

5. right_join() with a tibble of unique combinations

We can then use the replace_na() function to replace NA values with zeros.

6. anti_join() to select missing observations

Another question you can answer once you've created a data frame with all unique combinations of some variables is: which combinations were missing in the original data frame? To do so, you need to start from the dataset with all combinations, full_df in our case, and then perform an anti join with our original dataset, nuke_df, on the country and year columns. This will return all observations in full_df, that could not be found in nuke_df. We can see that in 1945 and 1946 the USSR and UK did not detonate any nukes, and that in 1947 none of these countries did.

7. Let's practice!

Now it's your turn, let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.