Exploratory data analysis

1. Exploratory data analysis

In this lesson, we will discuss basic techniques for exploratory data analysis, or EDA, with respect to our ads dataset. We'll take a closer look at our features, talk about missing data, and analyze distributions and breakdowns by CTR.

2. A closer look at features

Exploratory data analysis is important to get a better sense of the features for eventually predicting CTR, the top-line goal. To start, we can print the columns of our sample data and view the data-types of each column using the dtypes attribute of a pandas DataFrame. Any categorical variables are marked as an "object" data-type. For example, the id column is categorical. The common data types: int (integers), floats (numerical values with decimals), object (categorical variables), and datetime (DateTime). To view columns based on their data-types, we can use the select_dtypes method, which takes in an array of types to include. For example, for the above DataFrame using a filter for integers or floats would include the click column but not the id column.

3. Missing data

To check for missing data, you can get a quick overview using the info method. For a more detailed look, you can use the isnull method from pandas to return an array of boolean values for each column, as shown. In pandas, using an axis of 1 means iterating over the columns, and 0 means iterating over the rows. To get the total number of missing values in rows, you can use axis = 1, and vice versa for total number of missing values in columns, as shown. You can also use the sum method to yield the total number of missing values. If there are no missing values in either rows or columns, then no further steps are needed. Otherwise, missing values can be replaced with the mean or median of a column for numerical columns, and with the mode for categorical columns.

4. Looking at distributions

A quick but important step of exploratory data analysis is to look at the distributions of variables. For example, one of the columns in the dataset is called search engine type, and reflects the category of the search engine involved in the intent for an ad. To explore breakdown of CTR by search engine type, you can use the groupby method, followed by the size method, to get a quick look at the counts of the different search engine types, as shown. The value 1002 is an arbitrary one, but represents a particular search engine, such as Bing. By using the unstack method, which pivots a DataFrame, you get a format as follows, which shows the distribution of click values by search engine type. Those impressions with a search engine type of Bing had 240 clicks and 940 non-clicks.

5. Breakdown by CTR

Knowing how CTR varies by a feature is useful for prediction reasons. If you see that CTR varies significantly based on search engine type, as you did prior with device type and banner position, then it means search engine type is a good candidate feature to use. Following from the previous slide, you can reset the index of the unstacked DataFrame using the reset_index method, which looks like the following. It's helpful to rename the column 0 to 'non_clicks' by using the rename method, which references columns as a dictionary, and passes in the inplace argument to ensure the original DataFrame gets modified. Then you can compute CTR by dividing the number of clicks by the total number of impressions, which is the sum of non clicks and clicks.

6. Let's practice!

Now that you have done a high level overview of EDA, let's jump right in!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.