Training/testing partitions
Most of the time, when you run a predictive model, you need to fit the model on one subset of your data (the "training" set), then test the model predictions against the rest of your data (the "testing" set).
sdf_random_split()
provides a way of partitioning your data frame into training and testing sets. Its usage is as follows.
a_tibble %>%
sdf_random_split(training = 0.7, testing = 0.3)
There are two things to note about the usage. Firstly, if the partition values don't add up to one, they will be scaled so that they do. So if you passed training = 0.35
and testing = 0.15
, you'd get double what you asked for. Secondly, you can use any set names that you like, and partition the data into more than two sets. So the following is also valid.
a_tibble %>%
sdf_random_split(a = 0.1, b = 0.2, c = 0.3, d = 0.4)
The return value is a list of tibbles. you can access each one using the usual list indexing operators.
partitioned$a
partitioned[["b"]]
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Use
sdf_random_split()
to split the track metadata.- Put 70% in a set named
training
. - Put 30% in a set named
testing
.
- Put 70% in a set named
- Get the
sdf_dim()
ensions of the training tibble. - Get the dimensions of the testing tibble.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
partitioned <- track_metadata_tbl %>%
# Partition into training and testing sets
___
# Get the dimensions of the training set
___
# Get the dimensions of the testing set
___