Get startedGet started for free

Lines that intersect are without parallel

1. Lines that intersect are without parallel

Now that we've refreshed our knowledge of joins, we'll dive into some set theory clauses in dplyr.

2. Set theory clauses

For the remainder of this chapter, you'll work with different set theory clauses. Set theory clauses will allow us to compare and combine data in two different tibbles. The dplyr package has generalized these clauses to work with tibbles on a row-wise basis.

3. Venn diagrams for set theory

Set theory operations are usually represented using Venn diagrams. We can think of each circle as representing a dataset. The shading represents what is included in the result of the set operation from each dataset. intersect results in only those rows found in both datasets x and y. union includes every row in both datasets, but does not double count those that are in both tibbles. union_all includes every row in both datasets and replicates those that are in both datasets. This is why the center is shaded black. Where intersect is similar to an "and" operator, union and union_all are similar to the "or" operator. setdiff results in only those rows in x but not in y. Let's investigate what intersect looks like as a joining diagram. Note that each of these will only work if the column names are the same for both x and y.

4. intersect diagram

In this diagram, we have two tibbles with names left_col and right_col. The "col" here corresponds to each tibble having a single column. The result of the intersect is only the rows in common to both. This corresponds to the id values of 1 and 4. Let's see how this plays out with our Uruguay datasets from the last lesson.

5. Uruguay tibbles

The IMF Uruguayan data contains the iso code, country name, year, and consumer price index values for each year from 2011 to 2019. The World Bank data for Uruguay also contains the iso code, country name, and year. It differs in that it includes the percentage of rural population and only the years 2013 to 2016. So what happens if we try to perform an intersect on these two tibbles?

6. Trying out intersect()

Entering the datasets into intersect yields, unfortunately, an error. But, the error is helpful and let's you know that columns appear in one tibble but not the other. The intersect function isn't finding any matches here because it matches across entire rows, and the rows in each tibble have different columns. We were able to identify that the years 2013 to 2016 appear in both tibbles. We can also programmatically get to this by isolating on the year vectors in each tibble and sending those into intersect. We get the vector back as a result laying out these four years.

7. Difference between intersect() and a join

So are intersect and inner_join the same thing? They aren't. intersect looks for matches going across the entire row of a tibble. It might be the case that there are no rows in common across two tibbles, even if there are some key entries that match up. A join will only focus on those key entries and return rows that might include missing values, which we saw with left joins, for example. They do similar things, but intersect is matching across the whole row. This will also continue for union, union_all, and setdiff too. Each of these functions, just like intersect, works across the entire row.

8. Let's practice!

Try out the intersect function on some exercises.