Copying data into Spark
Before you can do any real work using Spark, you need to get your data into it. sparklyr
has some functions such as spark_read_csv()
that will read a CSV file into Spark. More generally, it is useful to be able to copy data from R to Spark. This is done with dplyr
's copy_to()
function. Be warned: copying data is a fundamentally slow process. In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another.
copy_to()
takes two arguments: a Spark connection (dest
), and a data frame (df
) to copy over to Spark.
Once you have copied your data into Spark, you might want some reassurance that it has actually worked. You can see a list of all the data frames stored in Spark using src_tbls()
, which simply takes a Spark connection argument (x
).
Throughout the course, you will explore track metadata from the Million Song Dataset. While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset. To clarify the terminology: a track refers to a row in the dataset. For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs).
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
track_metadata
, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.
- Use
str()
to explore thetrack_metadata
dataset. - Connect to your local Spark cluster, storing the connection in
spark_conn
. - Copy
track_metadata
to the Spark cluster usingcopy_to()
. - See which data frames are available in Spark, using
src_tbls()
. - Disconnect from Spark.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load dplyr
___
# Explore track_metadata structure
___
# Connect to your Spark cluster
spark_conn <- spark_connect("___")
# Copy track_metadata to Spark
track_metadata_tbl <- ___(___, ___, overwrite = TRUE)
# List the data frames available in Spark
___(___)
# Disconnect from Spark
spark_disconnect(___)