1. Initial EDA
Now we know how to figure out what problem we're addressing, and how to use the appropriate metric. The next step is to look at the data and find interesting patterns in it using Exploratory Data Analysis (EDA for short).
2. Goals of EDA
EDA has multiple goals.
To start with, we could get the size of the train and test data. It will give us an idea of how much resources we need for the competition and what models we could use.
Then we could investigate the properties of the target variable. For example, there could be a high class imbalance in the classification problem, or a skewed distribution in the regression problem.
Similarly, we could look at the properties of the features. Finding some peculiarities and dependencies between features and target variable is always useful.
Also, EDA is a good place to start in order to generate some ideas and future hypotheses on feature engineering.
3. Two sigma connect: rental listing inquiries
In this video we'll work with another Kaggle competition. It's called "Two sigma connect: rental listing inquiries".
In this Kaggle competition, we need to predict how popular an apartment rental listing is based on the listing content.
The target variable, 'interest_level', is defined by the number of inquiries a listing obtains.
Interest level is split into 3 groups: high, medium and low. So, we have a classification problem with 3 classes.
And the metric is a multi-class logarithmic loss.
4. EDA. Part I
Generally, the first part of the EDA is to look at some basic statistics regarding our data.
So, let's start with reading train and test data and finding their shape.
We see that train dataset has about 50 thousand observations and 11 columns. The test has about 75 thousand observations and all the columns except for the target variable.
5. EDA. Part I
Then let's look at the columns. We have the id of the observation, the number of bathrooms and bedrooms in the apartments, the exact coordinates of the apartments, the manager responsible for this listing, the renting price. And finally the target variable: 'interest_level'.
Then, for example, we could obtain the distribution of 'interest_level' using pandas' value_counts() method. The majority of the listings have low interest, while only about four thousand observations have high interest. It means that we have some class imbalance. However, it's not so crucial to apply any class balancing methods.
6. EDA. Part I
Another useful approach to take a first glimpse at the data is pandas' describe() method. It shows the basic statistics of all the numeric columns in the DataFrame. Let's apply it to our train data.
We see the minimum and maximum values, together with quartiles and mean values, as well as the count and standard deviation.
7. EDA. Part II
The next part of the EDA is to actually draw some plots and find interesting dependencies. We will use the matplotlib library. Let's import pyplot from matplotlib as plt and use ggplot style.
As an example, let's compare the median price of the apartments across different interest levels.
For this purpose, we use pandas' group_by() method and get the median. Note that we specified 'as_index' parameter to False in order not to turn 'interest_level' into the index.
8. EDA. Part II
We then create a figure and use the bar() method to plot 'interest_level' versus median price.
Also, set titles for the axis and the plot itself.
Finally, calling the show() method to see a plot.
9. EDA. Part II
As we see, medium and high interest listings have lower prices. It means that people are searching for cheaper apartments.
A potentially useful new feature would be the price per bedroom. If an apartment has a price per bedroom lower than market average, then the listing could obtain higher interest.
10. Let's practice!
You just saw a few examples of the initial EDA. Let's practice more on the Taxi Fare Prediction data!