1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to Spark with sparklyr in R

Exercise

Training/testing partitions

Most of the time, when you run a predictive model, you need to fit the model on one subset of your data (the "training" set), then test the model predictions against the rest of your data (the "testing" set).

sdf_partition() provides a way of partitioning your data frame into training and testing sets. Its usage is as follows.

a_tibble %>%
  sdf_partition(training = 0.7, testing = 0.3)

There are two things to note about the usage. Firstly, if the partition values don't add up to one, they will be scaled so that they do. So if you passed training = 0.35 and testing = 0.15, you'd get double what you asked for. Secondly, you can use any set names that you like, and partition the data into more than two sets. So the following is also valid.

a_tibble %>%
  sdf_partition(a = 0.1, b = 0.2, c = 0.3, d = 0.4)

The return value is a list of tibbles. you can access each one using the usual list indexing operators.

partitioned$a
partitioned[["b"]]

Instructions

100 XP

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Use sdf_partition() to split the track metadata.
    • Put 70% in a set named training.
    • Put 30% in a set named testing.
  • Get the dim()ensions of the training tibble.
  • Get the dimensions of the testing tibble.