Copying data into Spark

Before you can do any real work using Spark, you need to get your data into it. sparklyr has some functions such as spark_read_csv() that will read a CSV file into Spark. More generally, it is useful to be able to copy data from R to Spark. This is done with dplyr's copy_to() function. Be warned: copying data is a fundamentally slow process. In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another.

copy_to() takes two arguments: a Spark connection (dest), and a data frame (df) to copy over to Spark.

Once you have copied your data into Spark, you might want some reassurance that it has actually worked. You can see a list of all the data frames stored in Spark using src_tbls(), which simply takes a Spark connection argument (x).

Throughout the course, you will explore track metadata from the Million Song Dataset. While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset. To clarify the terminology: a track refers to a row in the dataset. For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs).

Deze oefening maakt deel uit van de cursus

Introduction to Spark with sparklyr in R

Cursus bekijken

Oefeninstructies

track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.

Use str() to explore the track_metadata dataset.
Connect to your local Spark cluster, storing the connection in spark_conn.
Copy track_metadata to the Spark cluster using copy_to() .
See which data frames are available in Spark, using src_tbls().
Disconnect from Spark.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Load dplyr
___

# Explore track_metadata structure
___

# Connect to your Spark cluster
spark_conn <- spark_connect("___")

# Copy track_metadata to Spark
track_metadata_tbl <- ___(___, ___, overwrite = TRUE)

# List the data frames available in Spark
___(___)

# Disconnect from Spark
spark_disconnect(___)

Code bewerken en uitvoeren