1. Machine learning with big datasets
An important part of creating models which fit our data well is preprocessing and preparing that data. Dask-ML has some specialized functions which help us with this preprocessing.
2. Loading and preprocessing data
Let's say we want to prepare some tabular data for modeling. First, we load the data as a lazy Dask DataFrame and split it into input variables X and target variables y. We do this by selecting columns from the DataFrame.
Before training a model, it is common to apply transforms to the input variables, as this helps improve the model performance. Many of the common preprocessing methods are available in Dask.
Here, we import the standard scaler class from Dask ML's preprocessing subpackage. This standard scalar transform will scale each column so that it has a mean of zero and a variance of one.
To use the standard-scalar, we need to create an instance of the class and fit it to the dataset. When the fit method is called, this is performed right away, not using lazy evaluation. The values needed to standardize the dataset are really calculated in this line.
Once we have fit the scalar, we can use it to transform the data. When the DataFrame is transformed, this is performed lazily.
3. Train-test split
Splitting our data into train and test sets is really important. We train the model on the training set, and the test set allows us to check the model performance on data it hasn't seen before. But how do we split the data when it is lazily loaded inside a Dask DataFrame? Thankfully, Dask-ML has a function that will split the DataFrame appropriately. We pass it the input variables X and the target values y; we tell it to shuffle each chunk of data and select the size of the test set. Here, we set test_size to 0.2, which means we keep 20 percent of the data as a test set.
4. Scoring
After we have trained the model, like we did in the last lesson, we can measure the model's performance. The score method takes the input variables, X_train, and uses them to make predictions. It then compares these predictions with y_train to calculate the residuals. It converts these residuals into a single score using the scoring method we set when we created the model. The score shown here is the negative mean square error.
We also print the score for the test set. The score for the test set is a little worse than for the training set. This is to be expected since it is data our model wasn't fit to. The test score tells you how well you would expect your model to perform when making predictions on more unseen data.
5. Let's practice!
In these final few exercises, you'll preprocess some data so that you can improve your model's performance. Let's practice.