Come together
The features to the models you are about to run are contained in the timbre
dataset, but the response – the year – is contained in the track_metadata
dataset. Before you run the model, you are going to have to join these two datasets together. In this case, there is a one to one matching of rows in the two datasets, so you need an inner join.
There is one more data cleaning task you need to do. The year
column contains integers, but Spark modeling functions require real numbers. You need to convert the year column to numeric
.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. Tibbles attached to the track metadata and timbre data stored in Spark have been pre-defined as track_metadata_tbl
and timbre_tbl
respectively.
- Inner join the track metadata to the timbre data by the
track_id
column. - Convert the
year
column tonumeric
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl, timbre_tbl pre-defined
track_metadata_tbl
timbre_tbl
track_metadata_tbl %>%
# Inner join to timbre_tbl
___ %>%
# Convert year to numeric
___