Anti joins
In the previous exercise, the joined dataset wasn't as big as you might have expected, since not all the artists had tags associated with them. Anti joins are really useful for finding problems with other joins.
An anti join returns the rows of the first table where it cannot find a match in the second table. The principle is shown in this diagram.
Anti joins are a type of filtering join, since they return the contents of the first table, but with their rows filtered depending upon the match conditions.
The syntax for an anti join is more or less the same as for a left join: simply swap left_join()
for anti_join()
.
anti_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. Tibbles attached to the track metadata and artist terms stored in Spark have been pre-defined as track_metadata_tbl
and artist_terms_tbl
respectively.
- Use an anti join to join the artist terms to the track metadata by the
artist_id
column. Assign the result tojoined
. - Use
sdf_dim()
to determine how many rows and columns there are in the joined table.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl and artist_terms_tbl have been pre-defined
track_metadata_tbl
artist_terms_tbl
# Anti join artist terms to track metadata by artist_id
joined <- ___
# How many rows and columns are in the joined table?
___
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
In which you learn more about using the <code>dplyr</code> interface to Spark, including advanced field selection, calculating groupwise statistics, and joining data frames.
Exercise 1: Leveling upExercise 2: Mother's little helper (1)Exercise 3: Mother's little helper (2)Exercise 4: Selecting unique rowsExercise 5: Common peopleExercise 6: Collecting data back from SparkExercise 7: Storing intermediate resultsExercise 8: Groups: great for music, great for dataExercise 9: Groups of mutantsExercise 10: Advanced Selection II: The SQLExercise 11: Left joinsExercise 12: Anti joinsExercise 13: Semi joinsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.