1. Học hỏi
  2. /
  3. Khoa Học
  4. /
  5. Introduction to Spark with sparklyr in R

Connected

Bài tập

Copying data into Spark

Before you can do any real work using Spark, you need to get your data into it. sparklyr has some functions such as spark_read_csv() that will read a CSV file into Spark. More generally, it is useful to be able to copy data from R to Spark. This is done with dplyr's copy_to() function. Be warned: copying data is a fundamentally slow process. In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another.

copy_to() takes two arguments: a Spark connection (dest), and a data frame (df) to copy over to Spark.

Once you have copied your data into Spark, you might want some reassurance that it has actually worked. You can see a list of all the data frames stored in Spark using src_tbls(), which simply takes a Spark connection argument (x).

Throughout the course, you will explore track metadata from the Million Song Dataset. While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset. To clarify the terminology: a track refers to a row in the dataset. For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs).

Hướng dẫn

100 XP

track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.

  • Use str() to explore the track_metadata dataset.
  • Connect to your local Spark cluster, storing the connection in spark_conn.
  • Copy track_metadata to the Spark cluster using copy_to() .
  • See which data frames are available in Spark, using src_tbls().
  • Disconnect from Spark.