Get startedGet started for free

Feature engineering

1. Feature engineering

In this lesson, we will discuss feature engineering - how to deal with different types of variables, and how to construct new features. Along the way, we will take a closer look at some of the existing features. Both will be important in making sure we can use the best features possible for our model in predicting CTR.

2. Dealing with dates

One important data type is the Datetime data type. Often, data meant to represent a datetime is stored in a string or integer format. For example, in the sample data, the hour column is an integer and represents a datetime with an hour component. You can use the to_datetime function from the pandas datetime module to parse a given column according to a provided format. For example, the hour column, which looks like an integer such as 14102101, turns into the following output: 2014-10-21 01:00:0. Once in a datetime format, various fields such as hour can be extracted from the datetime. Just like other data types, you can use standard groupby and aggregation functions like sum, as shown.

3. Converting categorical variables via hashing

Categorical features must be converted into a numerical format to be used in sklearn. One method of doing so is a hash function, which converts an arbitrary input into an integer output. It returns the exact same output for a given input every time. We can use the hash function as follows. First, we establish a lambda function, which is a function that takes some set of elements, and applies the function f(x) to every element of the set. It is written as lambda x, followed by a colon, then followed by the function f(x), which we define as the hash function. Next, we need to apply this function over every element within our DataFrame using the apply method. The apply method takes in a lambda function and an axis to operate over, just like we saw in the last lesson with axis being 0 or 1. For example, here we can apply the hash function to every row of the site_id column. As seen on the left hand side, the original format for each x is a string and the right hand side has the converted numerical result.

4. A closer look at features

Although many columns are represented as integers, these variables are actually categorical ones. To check this, similar to using the value_counts method in chapter 1, you can use the count method to get a count of total values and the nunique method to get a count of total unique values. For example, let's take a look at the ad_type column. There are 50000 rows according to count, and 31 categories according to nunique. Here is a plot of distribution of values for ad type. There is no clear pattern in the range and distribution of values. Therefore, this column, as well as many others in the dataset, is categorical.

5. Creating features

Because the dataset is mostly categorical variables, it is useful to construct numerical features using existing features. More features to explore means the models we run can have a better opportunity to discover what is truly predictive of CTR. For example, say that you wanted to construct a new feature that represents the number of total impressions by each user, which you can assume is linked to just one device_id. Then, you can define a new column and use the transform method to get counts by device id as follows. This can be done for other breakdowns, such as search engine type, product type, etc.

6. Let's practice!

Now that you've learned about feature engineering, let's work through some examples!