Using Dask to train a linear model

Dask can be used to train machine learning models on datasets that are too big to fit in memory, and allows you to distribute the data loading, preprocessing, and training across multiple threads, processes, and even across multiple computers.

You have been tasked with training a machine learning model which will predict the popularity of songs in the Spotify dataset you used in previous chapters. The data has already been loaded as lazy Dask DataFrames. The input variables are available as dask_X and contain a few numeric columns, such as the song's tempo and danceability. The target values are available as dask_y and are the popularity score of each song.

Import the SGDRegressor class from sklearn.linear_model and the Incremental class from dask_ml.wrappers.
Create a SGDRegressor linear regression model.
Use the Incremental class to wrap the model so that it can be trained with a Dask dataset, and set the scoring parameter to 'neg_mean_squared_error'.
Fit the wrapped model using only one loop through the data.

Lazy Evaluation and Parallel Computing

Parallel Processing of Big, Structured Data

Dask Bags for Unstructured Data

Dask Machine Learning and Final Pieces

Exercise

Using Dask to train a linear model

Instructions