Session Ready
Exercise

Anti joins

In the previous exercise, the joined dataset wasn't as big as you might have expected, since not all the artists had tags associated with them. Anti joins are really useful for finding problems with other joins.

An anti join returns the rows of the first table where it cannot find a match in the second table. The principle is shown in this diagram.

An anti join, explained using table of colors.

Anti joins are a type of filtering join, since they return the contents of the first table, but with their rows filtered depending upon the match conditions.

The syntax for an anti join is more or less the same as for a left join: simply swap left_join() for anti_join().

anti_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
Instructions
100 XP

A Spark connection has been created for you as spark_conn. Tibbles attached to the track metadata and artist terms stored in Spark have been pre-defined as track_metadata_tbl and artist_terms_tbl respectively.

  • Use an anti join to join the artist terms to the track metadata by the artist_id column. Assign the result to joined.
  • Use dim() to determine how many rows and columns there are in the joined table.