Standardizing data

Some models, like K-nearest neighbors (KNN) & neural networks, work better with scaled data -- so we'll standardize our data.

We'll also remove unimportant variables (day of week), according to feature importances, by indexing the features DataFrames with .iloc[]. KNN uses distances to find similar points for predictions, so big features outweigh small ones. Scaling data fixes that.

sklearn's scale() will standardize data, which sets the mean to 0 and standard deviation to 1. Ideally we'd want to use StandardScaler with fit_transform() on the training data, and fit() on the test data, but we are limited to 15 lines of code here.

Once we've scaled the data, we'll check that it worked by plotting histograms of the data.

Remove day of week features from train/test features using .iloc (day of week are the last 4 features).
Standardize train_features and test_features using sklearn's scale(); store scaled features as scaled_train_features and scaled_test_features.
Plot a histogram of the 14-day RSI moving average (indexed at [:, 2]) from unscaled train_features on the first subplot (ax[0]]).
Plot a histogram of the standardized 14-day RSI moving average on the second subplot (ax[1]).

Preparing data and a linear model

Machine learning tree methods

Neural networks and KNN

Machine learning with modern portfolio theory

Exercise

Standardizing data

Instructions