Cross-validation with shuffling

As you'll recall, cross-validation is the process of splitting your data into training and test sets multiple times. Each time you do this, you choose a different training and test set. In this exercise, you'll perform a traditional ShuffleSplit cross-validation on the company value data from earlier. Later we'll cover what changes need to be made for time series data. The data we'll use is the same historical price data for several large companies.

An instance of the Linear regression object (model) is available in your workspace along with the function r2_score() for scoring. Also, the data is stored in arrays X and y. We've also provided a helper function (visualize_predictions()) to help visualize the results.

This exercise is part of the course

Machine Learning for Time Series Data in Python

View Course

Exercise instructions

Initialize a ShuffleSplit cross-validation object with 10 splits.
Iterate through CV splits using this object. On each iteration:
- Fit a model using the training indices.
- Generate predictions using the test indices, score the model (\(R^2\)) using the predictions, and collect the results.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import ShuffleSplit and create the cross-validation object
from sklearn.model_selection import ShuffleSplit
cv = ____(____, random_state=1)

# Iterate through CV splits
results = []
for tr, tt in cv.____(X, y):
    # Fit the model on training data
    ____(X[tr], y[tr])
    
    # Generate predictions on the test data, score the predictions, and collect
    prediction = ____(X[tt])
    score = r2_score(____, ____)
    results.append((prediction, score, tt))

# Custom function to quickly visualize predictions
visualize_predictions(results)

Edit and Run Code