1. Scaling data and KNN Regression
In this chapter, we'll use k-nearest neighbors and neural networks to predict future stock values. Both methods typically work better with standardized data, which we'll learn how to do using Python and sklearn.
2. Remove unimportant features
Before we move on, let's review. We looked at feature importances from our tree-based models. These showed weekdays are unimportant.
3. Feature selection: remove weekdays
Let's remove these unimportant features. This is part of "feature selection", which is an important step in machine learning. Remember we created weekday features last, so we can remove the four weekday variables by indexing up to the last four columns.
4. Remove weekdays
Our features variables are pandas DataFrames, so we'll use dot-iloc to index the features and remove weekdays.
5. KNN introduction
Now let's look at how k-nearest neighbors, or KNN, works. Say we have data with 2-dimensional features that looks like this. The values of the target are shown on each point.
6. KNN predictions
If we have a new point, shown in blue, what is its target value? KNN takes the k-nearest points to this new point and averages their target values to get our prediction.
7. KNN predictions
We need to choose the number of neighbors as a hyperparameter, which we will say is 2 here. We take the average of the 2 nearest points and have our prediction of 5.
8. Minowski distance equation
It was easy to see which points are the closest in the previous plot. But this won't work for computations -- we need to use math. KNN uses Minowski distance to measure the distance between points. This is the same as Euclidean distance, or straight-line distance, when p=2.
9. Large vs. small features
If one feature's range is small, and another's is large, the larger feature outweighs the smaller in distance calculations. This can negatively affect our model's performance. It's usually best to normalize data so features have similar ranges. For example, the range of y values shown here is much larger than x value range, so y outweighs x in the distance calculations. The graph on the right has equal x and y-axes ranges to highlight the magnitude difference.
10. Scaling options
Among the options for scaling data are min-max scaling, standardization, median and median absolute deviation scaling, and mapping to functions like a sigmoid or hyperbolic tangent. In this case, we'll use standardization since it's easy to use and works well.
11. Standardization effects on KNN
Our features have similar ranges after standardization. This means KNN won't be biased towards large features in distance calculations. Sometimes this is helpful, other times it's not. We may want to bias the KNN model towards features with high importance, found from our random forest.
12. sklearn's scale
To standardize our features, we can use the scale class from sklearn-dot-preprocessing. We first create an instance of the class, then use the dot-fit_transform method on our train_features. This fits the scaler to the training data and transforms it at the same time. Then we use dot-transform on test_features to transform the test data.
13. Standardization
Standardization subtracts the mean from all data points, then divides by the standard deviation. This sets the mean to 0 and standard deviation to 1. This works best with Gaussian, or normal, distributions. Here we see the effect of standardization of the 14-day RSI moving average.
14. Making subplots
To make the previous plot, we create subplots with 2 rows and 1 column. This returns a figure and list of axes. Our features are a pandas DataFrame, so we use dot-iloc to get all rows, and the 3rd column which is the 14-day RSI moving average. The scaled features are a numpy array, so we can index with square brackets, and then plot using the second axis in the axes list.
15. Scale data and use KNN!
Ok, let's scale our features and see how KNN works.