Get startedGet started for free

Generating new features

1. Generating new features

Sometimes the format of our data can limit our ability to detect relationships or inhibit the potential performance of machine learning models. One method to overcome these issues is to generate new features from our data!

2. Correlation

Checking correlation with a heatmap, we see a moderate positive correlation between Price and Duration, but it looks like those are the only numeric variables in our dataset.

3. Viewing data types

Viewing the data types confirms this is the case. However, Total_Stops should also be numeric.

4. Total stops

Viewing the value_counts, we see we need to remove string characters, and change non-stop to zero, before converting the data type to integer.

5. Cleaning total stops

We use the string-dot-replace method to first remove " stops", including the space, so that flights with two, three, or four stops are ready to convert. Next we clean flights with one stop. Lastly, we change "non-stop" to "0", then set the data type to integer.

6. Correlation

Unsurprisingly, Total_Stops is strongly correlated with Duration. What is interesting is that Total_Stops and Price are more strongly correlated than Duration is with Price! Let's see what else we can find out!

7. Dates

Rechecking our data types, notice that there are three datetime variables - Date_of_Journey, Dep_Time, and Arrival_Time.

8. Extracting month and weekday

We know how to extract attributes from datetime values, so we can see if these offer any insights into pricing. To start, let's look at Date_of_Journey. If we think prices vary per month, it's worth using this attribute - we create it as a column in our DataFrame. Perhaps prices might also differ depending on the day of the week? Let's grab that using the dt-dot-weekday attribute. It returns values of zero, representing Monday, through to seven, for Sunday. Previewing these columns we see the first flight, departing on the 6th September, was a Friday, indicated by a four.

9. Departure and arrival times

We might wonder if people tend to pay more to depart or arrive at more convenient times. We extract the hour of departure and arrival from those respective columns too.

10. Correlation

Because they are numeric, we can calculate correlation between these new datetime features and other variables. Re-plotting our heatmap, unfortunately there aren't any new strong relationships. But we wouldn't have known this if we hadn't generated these features.

11. Creating categories

There's one more technique we can use to generate new features. We can group numeric data and label them as classes. For example, we don't have a column for ticket type. We could use descriptive statistics to label flights as economy, premium economy, business class, or first class, based on prices within specific ranges, or bins.

12. Descriptive statistics

We'll split equally across the price range using quartiles. We first store the 25th percentile using the quantile method. We get the 50th percentile by calling the median. Next we get the 75th percentile, and lastly, we store the maximum value.

13. Labels and bins

We create the labels, in this case our ticket types, and store as a list. Next, we create the bins, a list starting from zero and including our descriptive statistic variables.

14. pd.cut()

We now call the pd-dot-cut function,

15. pd.cut()

passing our Price column,

16. pd.cut()

setting the labels argument equal to our labels variable,

17. pd.cut()

and the bins argument equal to our bins.

18. Price categories

Previewing the Price and Price_Category columns, we see the mapping has been successfully applied!

19. Price category by airline

We can plot the count of flights in different categories per airline by passing our new column to the hue argument when calling sns-dot-countplot.

20. Price category by airline

Looks like Jet Airways has the largest number of "First Class" tickets, while most of IndiGo and SpiceJet's flights are "Economy".

21. Let's practice!

Let's generate some new features in our data professionals dataset!