Partitioning data with a group effect

Before you can run any models, you need to partition your data into training and testing sets. There's a complication with this dataset, which means you can't just call sdf_random_split(). The complication is that each track by a single artist ought to appear in the same set; your model will appear more accurate than it really is if tracks by an artist are used to train the model then appear in the testing set.

The trick to dealing with this is to partition only the artist IDs, then inner join those partitioned IDs to the original dataset. Note that artist_id is more reliable than artist_name for partitioning, since some artists use variations on their name between tracks. For example, Duke Ellington sometimes has an artist name of "Duke Ellington", but other times has an artist name of "Duke Ellington & His Orchestra", or one of several spelling variants.

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_tbl.

Partition the artist IDs into training and testing sets, assigning the result to training_testing_artist_ids.
- Select the artist_id column of track_data_tbl.
- Get distinct rows.
- Partition this into 70% training and 30% testing.
Inner join the training dataset to track_data_tbl by artist_id, assigning the result to track_data_to_model_tbl.
Inner join the testing dataset to track_data_tbl by artist_id, assigning the result to track_data_to_predict_tbl.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Partitioning data with a group effect

Instructions