1. Generating new features
Sometimes the format of our data can limit our ability to detect relationships or inhibit the potential performance of machine learning models. One method to overcome these issues is to generate new features from our data!
2. Correlation
Checking correlation with a heatmap, we see a moderate positive correlation between Price and Duration, but it looks like those are the only numeric variables in our dataset.
3. Viewing data types
Viewing the data types confirms this is the case. However, Total_Stops should also be numeric.
4. Total stops
Viewing the value_counts, we see we need to remove string characters, and change non-stop to zero, before converting the data type to integer.
5. Cleaning total stops
We use the string-dot-replace method to first remove " stops", including the space, so that flights with two, three, or four stops are ready to convert.
Next we clean flights with one stop.
Lastly, we change "non-stop" to "0", then set the data type to integer.
6. Correlation
Unsurprisingly, Total_Stops is strongly correlated with Duration.
What is interesting is that Total_Stops and Price are more strongly correlated than Duration is with Price!
Let's see what else we can find out!
7. Dates
Rechecking our data types, notice that there are three datetime variables - Date_of_Journey, Dep_Time, and Arrival_Time.
8. Extracting month and weekday
We know how to extract attributes from datetime values, so we can see if these offer any insights into pricing.
To start, let's look at Date_of_Journey. If we think prices vary per month, it's worth using this attribute - we create it as a column in our DataFrame. Perhaps prices might also differ depending on the day of the week? Let's grab that using the dt-dot-weekday attribute. It returns values of zero, representing Monday, through to seven, for Sunday.
Previewing these columns we see the first flight, departing on the 6th September, was a Friday, indicated by a four.
9. Departure and arrival times
We might wonder if people tend to pay more to depart or arrive at more convenient times.
We extract the hour of departure and arrival from those respective columns too.
10. Correlation
Because they are numeric, we can calculate correlation between these new datetime features and other variables.
Re-plotting our heatmap, unfortunately there aren't any new strong relationships. But we wouldn't have known this if we hadn't generated these features.
11. Creating categories
There's one more technique we can use to generate new features. We can group numeric data and label them as classes.
For example, we don't have a column for ticket type. We could use descriptive statistics to label flights as economy, premium economy, business class, or first class, based on prices within specific ranges, or bins.
12. Descriptive statistics
We'll split equally across the price range using quartiles.
We first store the 25th percentile using the quantile method.
We get the 50th percentile by calling the median.
Next we get the 75th percentile,
and lastly, we store the maximum value.
13. Labels and bins
We create the labels, in this case our ticket types, and store as a list.
Next, we create the bins, a list starting from zero and including our descriptive statistic variables.
14. pd.cut()
We now call the pd-dot-cut function,
15. pd.cut()
passing our Price column,
16. pd.cut()
setting the labels argument equal to our labels variable,
17. pd.cut()
and the bins argument equal to our bins.
18. Price categories
Previewing the Price and Price_Category columns, we see the mapping has been successfully applied!
19. Price category by airline
We can plot the count of flights in different categories per airline by passing our new column to the hue argument when calling sns-dot-countplot.
20. Price category by airline
Looks like Jet Airways has the largest number of "First Class" tickets, while most of IndiGo and SpiceJet's flights are "Economy".
21. Let's practice!
Let's generate some new features in our data professionals dataset!