1. Training machine learning models on big datasets
In data analytics, we often face datasets which are too large to fit in memory, or calculations which take too long to compute. We often face these same issues in machine learning.
2. Dask-ML
Dask-ML is a library which allows us to use the power of Dask to speed up machine learning workflows using multiple threads, processes, and even multiple computers.
3. Linear regression
Let's say we have this dataset. We want to fit a model to it so that we can predict the target variable y from the input variable x. After we have fit the model, we will be able to make predictions for new values of x in the future.
4. Linear regression
One way to model the data is using linear regression. In linear regression, we fit a straight line to the data.
5. Linear regression
We choose a straight line that minimizes the distances between the fitted line and the data points - these distances are called residuals. Usually, we minimize the mean of the square of these residuals because it has some nice statistical properties.
6. Fitting a linear regression model
If we are using a dataset which fits in memory, we can fit a linear model using scikit-learn. We import the SGDRegressor model class and create an instance of the class. We then fit the model to the data using the model's fit method. We pass it the input variables X and the target variables y. Once it has been fit to the data, we can use the model to make predictions.
In this case, X and y can be NumPy arrays or pandas DataFrames. This means they need to small enough to fit into memory.
7. Using a scikit-learn model with Dask
But what if the data is too big? What if we want to fit the model to Dask DataFrames, or arrays. This is precisely what Dask-ML is for.
Dask-ML allows us to fit scikit-learn models to lazy datasets.
To use this scikit-learn model with Dask, we need to import the Incremental class from Dask-ML-dot-wrappers.
We create an Incremental object and pass the scikit-learn model into it along with the scoring function. The scoring function converts the values of all residuals between the data points and the line into a single number which tells us how well the model fits. Here, the model will maximize the negative of the mean square error. Maximizing this is the same as minimizing the mean square error.
Once we have created this object, we can fit the model to the Dask dataset. This fitting is performed immediately, not lazily.
8. Fitting takes multiple iterations
The model is fit by repeatedly making small adjustments to the position of the line. We loop through the data points and make an adjustment to make the model's prediction closer to the real value. When we use the scikit-learn model's fit method, it will loop through the full dataset many times. If we use the Dask model's fit method, it only loops through the dataset once. Calling the fit method multiple times won't work, so if we want to loop through the dataset many times, we must use the partial_fit method.
9. Training an Incremental model
We can loop through the data many times by using the partial_fit method in a for-loop.
10. Generating predictions
Once we have trained the model, we can make predictions using the Incremental object's dot-predict method. This is a lazy transform, and it returns a Dask array. As normal, we can use the compute method to return the values.
11. Let's practice!
Okay, let's use Dask to train a machine learning model.