1. Standardizing features
In this lesson, we will discuss what feature standardization is, why it is important, and how to apply it to features.
2. Why standardization is important
Standardization is the process of ensuring that your data fits under the assumptions that your model has about the features. If a feature has a much larger variance than another feature, it may dominate the other features within the model, which is undesirable. In the context of CTR prediction, imagine you have several count features, like the ones you created in the previous exercise. If one of them has a much larger range of values, say device id count, due to one spam user continually clicking, then the model will heavily weight that count over others, even though the best approach might consider all of the count features equally. Therefore, it is important to standardize features to produce the best possible CTR prediction model. Note that standardization does not apply to categorical variables, such as the id columns seen in the previous lesson, including: site_id, app_id, device_id, etc. since by definition those variables do not have a numerical scale to be compared against.
3. Log normalization
There are several methods to standardize features. One common method is log normalization, which reduces the variance of features. It should only be applied to features with very high variance relative to other features. To check the variances of columns in a DataFrame, you can use var method as follows. The output shows the variance for each column. Since the output is an array, it is possible to run standard python methods like .mean() and .median() to get the average and median variance. To use log normalization, you can take a particular column and apply numpy's log() function to all elements of that column, as shown here with the column device_id_count. The result is that the values of that column will have a reduced range of values and hence a reduced variance, as seen above. The larger the feature's original variance, the larger the reduction in variance due to the mathematical properties of the log function.
4. Scaling data
Standard scaling is another standardization process by which all numerical features have a mean of 0 and a standard deviation of 1. This is done to ensure that no feature's scales or ranges of values will impact the model more than that of other features. It also makes features easy to compare to one another. For example, it makes it easier to compare different counts such as device id count, site id count, search engine type count, etc. This will allow the model to effectively figure out which counts are most predictive for CTR. Although standard scaling has more impact on certain models than others, it is generally a good practice in machine learning.
5. How to standard scale data
Standard scaling can be done via sklearn's StandardScaler as follows. First, you instantiate an instance of the class, and then you can use the fit_transform method on any DataFrame. Since it only makes sense to scale numeric columns, you should make sure to select a relevant subset of the features within the DataFrame. For example, it would not make any sense to scale Datetime columns like the hour column, or categorical columns like device_id. Therefore, it will only be applied to a relevant universe of numerical columns. The result of the scaling looks like this: each row for every column will have changed values to have a mean of 0 and standard deviation of 1.
6. Let's practice!
Now that you've seen some examples of standardization, let's practice!