Aan de slagGa gratis aan de slag

Copying data into Spark

Before you can do any real work using Spark, you need to get your data into it. sparklyr has some functions such as spark_read_csv() that will read a CSV file into Spark. More generally, it is useful to be able to copy data from R to Spark. This is done with dplyr's copy_to() function. Be warned: copying data is a fundamentally slow process. In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another.

copy_to() takes two arguments: a Spark connection (dest), and a data frame (df) to copy over to Spark.

Once you have copied your data into Spark, you might want some reassurance that it has actually worked. You can see a list of all the data frames stored in Spark using src_tbls(), which simply takes a Spark connection argument (x).

Throughout the course, you will explore track metadata from the Million Song Dataset. While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset. To clarify the terminology: a track refers to a row in the dataset. For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs).

Deze oefening maakt deel uit van de cursus

Introduction to Spark with sparklyr in R

Cursus bekijken

Oefeninstructies

track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.

  • Use str() to explore the track_metadata dataset.
  • Connect to your local Spark cluster, storing the connection in spark_conn.
  • Copy track_metadata to the Spark cluster using copy_to() .
  • See which data frames are available in Spark, using src_tbls().
  • Disconnect from Spark.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Load dplyr
___

# Explore track_metadata structure
___

# Connect to your Spark cluster
spark_conn <- spark_connect("___")

# Copy track_metadata to Spark
track_metadata_tbl <- ___(___, ___, overwrite = TRUE)

# List the data frames available in Spark
___(___)

# Disconnect from Spark
spark_disconnect(___)
Code bewerken en uitvoeren