Partitioning data with a group effect
Before you can run any models, you need to partition your data into training and testing sets. There's a complication with this dataset, which means you can't just call sdf_random_split()
. The complication is that each track by a single artist ought to appear in the same set; your model will appear more accurate than it really is if tracks by an artist are used to train the model then appear in the testing set.
The trick to dealing with this is to partition only the artist IDs, then inner join those partitioned IDs to the original dataset. Note that artist_id
is more reliable than artist_name
for partitioning, since some artists use variations on their name between tracks. For example, Duke Ellington sometimes has an artist name of "Duke Ellington"
, but other times has an artist name of "Duke Ellington & His Orchestra"
, or one of several spelling variants.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_tbl
.
- Partition the artist IDs into training and testing sets, assigning the result to
training_testing_artist_ids
.- Select the
artist_id
column oftrack_data_tbl
. - Get distinct rows.
- Partition this into 70% training and 30% testing.
- Select the
- Inner join the training dataset to
track_data_tbl
byartist_id
, assigning the result totrack_data_to_model_tbl
. - Inner join the testing dataset to
track_data_tbl
byartist_id
, assigning the result totrack_data_to_predict_tbl
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_data_tbl has been pre-defined
track_data_tbl
training_testing_artist_ids <- track_data_tbl %>%
# Select the artist ID
___ %>%
# Get distinct rows
___ %>%
# Partition into training/testing sets
___
track_data_to_model_tbl <- track_data_tbl %>%
# Inner join to training partition
___
track_data_to_predict_tbl <- track_data_tbl %>%
# Inner join to testing partition
___