Come together

The features to the models you are about to run are contained in the timbre dataset, but the response – the year – is contained in the track_metadata dataset. Before you run the model, you are going to have to join these two datasets together. In this case, there is a one to one matching of rows in the two datasets, so you need an inner join.

There is one more data cleaning task you need to do. The year column contains integers, but Spark modeling functions require real numbers. You need to convert the year column to numeric.

A Spark connection has been created for you as spark_conn. Tibbles attached to the track metadata and timbre data stored in Spark have been pre-defined as track_metadata_tbl and timbre_tbl respectively.

Inner join the track metadata to the timbre data by the track_id column.
Convert the year column to numeric.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Come together

Instructions