Get startedGet started for free

Generating hypotheses

1. Generating hypotheses

Generating hypotheses is a fundamental task for data scientists. Let's look at how and when this is done!

2. What do we know?

It's reasonable to feel like we have a good idea about our planes dataset at this point, right? We've explored our data extensively and even generated new features to get new insights! We know that a large proportion of Jet Airways' tickets are expensive, as we labeled them as First Class!

3. What do we know?

We also know that Duration, Total_Stops, and Price are all moderately correlated, but no other meaningful relationships exist.

4. Spurious correlation

But if we generate a scatter plot of Price versus Duration, factoring Total_Stops, it looks like Total_Stops largely depend on Duration. This is an example of a spurious correlation - we might think that Total_Stops is correlated with Price, but in fact its just Duration that is correlated and Total_Stops mostly maps to Duration ranges!

5. How do we know?

Also, if we split out the number of stops to look at correlation individually, it looks like zero stops has a strong negative correlation with price, but there's no meaningful relationship for journeys with three of four stops!

6. What is true?

When performing EDA, the question we should ask is how do we know what we are observing is true? For example, if we collected new data on flights from a different time period, would we observe the same results? To make conclusions regarding relationships, differences, and patterns in our data, we need to use a branch of statistics called Hypothesis Testing. This involves the following steps before we even start collecting data: coming up with a hypothesis, or question, and specifying a statistical test that we will perform in order to reasonably conclude whether the hypothesis was true or false.

7. Data snooping

Let's imagine we work for an agency regulating airlines, so we have our planes data available as part of our day-to-day work without any specific questions in mind. We might be thinking, well, we have all this data, so why not just come up with questions and run some tests now? But we didn't collect the data with the aim of answering these questions. Plus, we've already looked at the data extensively and generated new features, so we might be bias and generate hypotheses that we are confident exist to prove ourselves right! We could also be tempted to run lots of tests, since we have lots of data. The acts of excessive exploratory analysis, the generation of multiple hypotheses, and the execution of multiple statistical tests are collectively known as data snooping, or p-hacking. Chances are, if we look at enough data and run enough tests, we will find a significant result.

8. Generating hypotheses

So how do we generate hypotheses? We perform some EDA! Say we think that, on average, Jet Airways flights last longer than SpiceJet. We can create a bar plot, which shows us the mean duration per Airline.

9. Generating hypotheses

Or we might have a hunch that flights to New Delhi are more expensive than other destinations on average. Again, we can plot the data to see if this seems to be the case.

10. Next steps

From there, we need to design our experiment. This involves many steps such as choosing a sample, calculating how many data points we need, and deciding what statistical test to run. The steps involved in this process are outside the scope of this course, but hopefully we now have a sense of the advantages, limitations, and overall remit of exploratory data analysis in a data science workflow!

11. Let's practice!

Time to practice generating hypotheses!