Get startedGet started for free

Training/testing partitions

Most of the time, when you run a predictive model, you need to fit the model on one subset of your data (the "training" set), then test the model predictions against the rest of your data (the "testing" set).

sdf_random_split() provides a way of partitioning your data frame into training and testing sets. Its usage is as follows.

a_tibble %>%
  sdf_random_split(training = 0.7, testing = 0.3)

There are two things to note about the usage. Firstly, if the partition values don't add up to one, they will be scaled so that they do. So if you passed training = 0.35 and testing = 0.15, you'd get double what you asked for. Secondly, you can use any set names that you like, and partition the data into more than two sets. So the following is also valid.

a_tibble %>%
  sdf_random_split(a = 0.1, b = 0.2, c = 0.3, d = 0.4)

The return value is a list of tibbles. you can access each one using the usual list indexing operators.

partitioned$a
partitioned[["b"]]

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Use sdf_random_split() to split the track metadata.
    • Put 70% in a set named training.
    • Put 30% in a set named testing.
  • Get the sdf_dim()ensions of the training tibble.
  • Get the dimensions of the testing tibble.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

partitioned <- track_metadata_tbl %>%
  # Partition into training and testing sets
  ___

# Get the dimensions of the training set
___

# Get the dimensions of the testing set
___
Edit and Run Code