Semi joins
Semi joins are the opposite of anti joins: an anti-anti join, if you like.
A semi join returns the rows of the first table where it can find a match in the second table. The principle is shown in this diagram.
The syntax is the same as for other join types; simply swap the other join function for semi_join()
semi_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
You may have spotted that the results of a semi join plus the results of an anti join give the orignial table. So, regardless of the table contents or how you join them, semi_join(A, B)
plus anti_join(A, B)
will return A
(though maybe with the rows in a different order).
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. Tibbles attached to the track metadata and artist terms stored in Spark have been pre-defined as track_metadata_tbl
and artist_terms_tbl
respectively.
- Use a semi join to join the artist terms to the track metadata by the
artist_id
column. Assign the result tojoined
. - Use
sdf_dim()
to determine how many rows and columns there are in the joined table.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl and artist_terms_tbl have been pre-defined
track_metadata_tbl
artist_terms_tbl
# Semi join artist terms to track metadata by artist_id
joined <- ___
# How many rows and columns are in the joined table?
___