Using Dask to train a linear model
Dask can be used to train machine learning models on datasets that are too big to fit in memory, and allows you to distribute the data loading, preprocessing, and training across multiple threads, processes, and even across multiple computers.
You have been tasked with training a machine learning model which will predict the popularity of songs in the Spotify dataset you used in previous chapters. The data has already been loaded as lazy Dask DataFrames. The input variables are available as dask_X
and contain a few numeric columns, such as the song's tempo and danceability. The target values are available as dask_y
and are the popularity score of each song.
This exercise is part of the course
Parallel Programming with Dask in Python
Exercise instructions
- Import the
SGDRegressor
class fromsklearn.linear_model
and theIncremental
class fromdask_ml.wrappers
. - Create a
SGDRegressor
linear regression model. - Use the
Incremental
class to wrap the model so that it can be trained with a Dask dataset, and set thescoring
parameter to'neg_mean_squared_error'
. - Fit the wrapped model using only one loop through the data.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the SGDRegressor and the Incremental wrapper
from ____ import ____
from ____ import ____
# Create a SGDRegressor model
model = ____
# Wrap the model so that it works with Dask
dask_model = ____
# Fit the wrapped model
dask_model.____