Get startedGet started for free

Copying data into Spark

Before you can do any real work using Spark, you need to get your data into it. sparklyr has some functions such as spark_read_csv() that will read a CSV file into Spark. More generally, it is useful to be able to copy data from R to Spark. This is done with dplyr's copy_to() function. Be warned: copying data is a fundamentally slow process. In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another.

copy_to() takes two arguments: a Spark connection (dest), and a data frame (df) to copy over to Spark.

Once you have copied your data into Spark, you might want some reassurance that it has actually worked. You can see a list of all the data frames stored in Spark using src_tbls(), which simply takes a Spark connection argument (x).

Throughout the course, you will explore track metadata from the Million Song Dataset. While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset. To clarify the terminology: a track refers to a row in the dataset. For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs).

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.

  • Use str() to explore the track_metadata dataset.
  • Connect to your local Spark cluster, storing the connection in spark_conn.
  • Copy track_metadata to the Spark cluster using copy_to() .
  • See which data frames are available in Spark, using src_tbls().
  • Disconnect from Spark.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load dplyr
___

# Explore track_metadata structure
___

# Connect to your Spark cluster
spark_conn <- spark_connect("___")

# Copy track_metadata to Spark
track_metadata_tbl <- ___(___, ___, overwrite = TRUE)

# List the data frames available in Spark
___(___)

# Disconnect from Spark
spark_disconnect(___)
Edit and Run Code