Get Started

Left joins

As well as manipulating single data frames, sparklyr allows you to join two data frames together. A full treatment of how to join tables together using dplyr syntax is given in the Joining Data with dplyr course. For the rest of this chapter, you'll see some examples of how to do this using Spark.

A left join takes all the values from the first table, and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values. The principle is shown in this diagram.

A left join, explained using table of colors.

Left joins are a type of mutating join, since they simply add columns to the first table. To perform a left join with sparklyr, call left_join(), passing two tibbles and a character vector of columns to join on.

left_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))

When you describe this join in words, the table names are reversed. This join would be written as "another_tibble is left joined to a_tibble".

This exercise introduces another Spark DataFrame containing terms that describe each artist. These range from rather general terms, like "pop", to more niche genres such as "swiss hip hop" and "mathgrindcore".

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. Tibbles attached to the track metadata and artist terms stored in Spark have been pre-defined as track_metadata_tbl and artist_terms_tbl respectively.

  • Use a left join to join the artist terms to the track metadata by the artist_id column.
    • The table to be joined to, track_metadata_tbl, comes first.
    • The table that joins the first, artist_terms_tbl, comes next.
    • Assign the result to joined.
  • Use sdf_dim() to determine how many rows and columns there are in the joined table.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl and artist_terms_tbl have been pre-defined
track_metadata_tbl
artist_terms_tbl

# Left join artist terms to track metadata by artist_id
joined <- ___(___, ___, by = ___)

# How many rows and columns are in the joined table?
___
Edit and Run Code

This exercise is part of the course

Introduction to Spark with sparklyr in R

IntermediateSkill Level
5.0+
4 reviews

Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.

In which you learn more about using the <code>dplyr</code> interface to Spark, including advanced field selection, calculating groupwise statistics, and joining data frames.

Exercise 1: Leveling upExercise 2: Mother's little helper (1)Exercise 3: Mother's little helper (2)Exercise 4: Selecting unique rowsExercise 5: Common peopleExercise 6: Collecting data back from SparkExercise 7: Storing intermediate resultsExercise 8: Groups: great for music, great for dataExercise 9: Groups of mutantsExercise 10: Advanced Selection II: The SQLExercise 11: Left joins
Exercise 12: Anti joinsExercise 13: Semi joins

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free