1. Data preparation for kNN
You've now seen the kNN algorithm in action while simulating aspects of a self-driving vehicle. You've gained an understanding of the impact of k on the algorithm's performance, and know how to examine the neighbors' votes to better understand which predictions are closer to unanimous.
But before applying kNN to your own projects, you'll need to know one more thing: how to prepare your data for nearest neighbors.
2. kNN assumes numeric data
As noted previously, nearest neighbor learners use distance functions to identify the most similar, or nearest examples. Many common distance functions assume that your data are in numeric format, as it is difficult to define the distance between categories.
For example, there's no obvious way to define the distance between "red" and "yellow"; consequently, the traffic sign dataset represented these using numeric color intensities.
But suppose that you have a property that cannot be measured numerically, such as whether a road sign is a rectangle, diamond, or octagon. A common solution uses 1/0 indicators to represent these categories. This is called dummy coding.
A binary "dummy" variable is created for each category except one. This variable is set to '1' if the category applies and '0' otherwise. The category that is left out can be easily deduced, if the stop sign is not a rectangle or a diamond, then it must be an octagon.
Dummy coded data can be used directly in a distance function; two rectangle signs, both having values of '1', will be found to be closer together than a rectangle and a diamond.
3. kNN benefits from normalized data
It is also important to be aware that when calculating distance, each feature of the input data should be measured with the same range of values.
This was true for the traffic sign data; each color component ranged from a minimum of zero to a maximum of 255.
However, suppose that we added the 1/0 dummy variables for sign shapes into the distance calculation. Two different shapes may differ by at most one unit, but two different colors may differ by as much as 255 units!
Such a different scale allows the features with a wider range to have more influence over the distance calculation, as this figure illustrates. Here, the topmost speed limit sign is closer to the pedestrian sign than it is to its correct neighbors, simply because the range of blue values is wider than the 0-to-1 range of shape values.
Compressing the blue axis so that it also follows a 0-to-1 range corrects this issue, and the speed limit sign is now closer to its true neighborhood.
4. Normalizing data in R
R does not have a built-in function in R to rescale data to a given range, so you'll need to create one yourself. The code here defines a function called normalize which can be used to perform min-max normalization. This rescales a vector x such that it its minimum value is zero and its maximum value is one. It does this by subtracting the minimum value from each value of x and dividing by the range of x values.
After applying this function to r-one, one of the color vectors, we can use the summary function to see that the new minimum and maximum values are 0 and 1 respectively. Calculating the same summary statistics for the unnormalized data shows a minimum of 3 and a maximum of 251.
5. Let's practice!
I hope you enjoyed your time simulating the training of an autonomous car. After a brief test of what you've learned about normalization, you'll get started on another interesting classification task.