1. Feature hashing
Sometimes, we create dummy variables using data that doesn't contain all possible values of a factor. This can cause problems when feeding unseen data into a model, since unobserved factor values can appear for which we don't have a dummy variable.
2. What is feature hashing?
Feature hashing, aka the hashing trick, is another dimensionality reduction technique used in machine learning to transform high-dimensional data into a lower-dimensional representation.
It applies a hash function to the input features, mapping each variable to a specific index in a hash table.
These values are used as indices in a feature vector, effectively encoding the original features into a compact, fixed-length representation.
As illustrated by the table, we assign the index 30 to UA, 32 to WN, 27 to DL, etc.
We can use hashing in large-scale machine learning applications where memory and computation resources are limited, and when we introduce new data, new values of the predictor can appear.
3. How many carriers are there?
One feature in the flights dataset is a vector of carriers, which could be an informative predictor to model late arrivals.
If we only know the carriers that show up in our sample, so dummy variables can cause trouble if an unobserved carrier appears when we are trying to make predictions on new data, as we wouldn't have a dummy variable to represent this new carrier.
4. Let us hash that feature
We can overcome this problem by creating dummy hashes to index the data, allowing us to deal with new values as the hashing function will assign an index for those.
To do this, we add a step_dummy_hash to our recipe. The first argument is a set of columns with the features we are after. In our case, this is the carrier, which is all we need for the function to work. Next, we added a couple of arguments to make things more concise.
By default, the step generates a long prefix followed by the column name and the index value. Setting it to NULL, we get back only the column name and the index, making it easier to read.
We set the signed argument to FALSE to obtain a binary code, 0 or 1. The default will use the values -1, 0, and 1 to represent the feature values, and that's OK, but it can be confusing at first. So, let's stick to the more intuitive 0, 1 classification.
Finally, the num_terms parameter does precisely what it sounds like. The default is 32, and we can set it to any number to,say, fifty which makes us comfortable that we are representing all factors. Some applications use thousands or even millions for this value.
We prep and bake the recipe. Bake for a recipe is similar to predict for a model.
We are ready to take a peak at the representation. In this portion of the matrix, we can see that carrier B6, appearing in the second observation, is indexed as 18.
5. Visualizing the hashing
We can explore the hash indexing matrix visually. The plot.matrix package extends the base R plot functionality to do just that.
The data argument for plot needs to be in matrix form, and we will restrict our view to the first 50 rows.
Setting the key to null eliminates the legend and border = NA tells R to refrain from plotting a grid.
We can see that flights from a carrier indexed around 40 are frequent in this section of the data. We also see that many columns are empty, since there are no corresponding flights in this portion of the data.
6. Let's practice!
Let's give it a try.