Training/testing partitions

Most of the time, when you run a predictive model, you need to fit the model on one subset of your data (the "training" set), then test the model predictions against the rest of your data (the "testing" set).

sdf_random_split() provides a way of partitioning your data frame into training and testing sets. Its usage is as follows.

a_tibble %>%
  sdf_random_split(training = 0.7, testing = 0.3)

There are two things to note about the usage. Firstly, if the partition values don't add up to one, they will be scaled so that they do. So if you passed training = 0.35 and testing = 0.15, you'd get double what you asked for. Secondly, you can use any set names that you like, and partition the data into more than two sets. So the following is also valid.

a_tibble %>%
  sdf_random_split(a = 0.1, b = 0.2, c = 0.3, d = 0.4)

The return value is a list of tibbles. you can access each one using the usual list indexing operators.

partitioned$a
partitioned[["b"]]

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Use sdf_random_split() to split the track metadata.
- Put 70% in a set named training.
- Put 30% in a set named testing.
Get the sdf_dim()ensions of the training tibble.
Get the dimensions of the testing tibble.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Training/testing partitions

Instructions