Copying data into Spark
Before you can do any real work using Spark, you need to get your data into it. sparklyr has some functions such as spark_read_csv() that will read a CSV file into Spark. More generally, it is useful to be able to copy data from R to Spark. This is done with dplyr's copy_to() function. Be warned: copying data is a fundamentally slow process. In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another.
copy_to() takes two arguments: a Spark connection (dest), and a data frame (df) to copy over to Spark.
Once you have copied your data into Spark, you might want some reassurance that it has actually worked. You can see a list of all the data frames stored in Spark using src_tbls(), which simply takes a Spark connection argument (x).
Throughout the course, you will explore track metadata from the Million Song Dataset. While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset. To clarify the terminology: a track refers to a row in the dataset. For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs).
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.
- Use
str()to explore thetrack_metadatadataset. - Connect to your local Spark cluster, storing the connection in
spark_conn. - Copy
track_metadatato the Spark cluster usingcopy_to(). - See which data frames are available in Spark, using
src_tbls(). - Disconnect from Spark.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load dplyr
___
# Explore track_metadata structure
___
# Connect to your Spark cluster
spark_conn <- spark_connect("___")
# Copy track_metadata to Spark
track_metadata_tbl <- ___(___, ___, overwrite = TRUE)
# List the data frames available in Spark
___(___)
# Disconnect from Spark
spark_disconnect(___)