Get startedGet started for free

Hypothesis testing

1. Hypothesis testing

Welcome to the final chapter of the course, where we'll talk about correlation and hypothesis testing. Let's start with hypothesis testing.

2. Why do we need to know about hypothesis testing?

Hypothesis testing is a group of theories, methods, and techniques to compare populations. So why do we need to know about hypothesis testing? Firstly, it is routinely used in many industries. For example, a company may have a theory that increasing the price of their product will increase revenue, or changing the name of a website might increase traffic. We can even use hypothesis testing to analyze whether a medication is effective in the treatment of specific health conditions!

3. The history of hypothesis testing

Not only is hypothesis testing all around us, but it is also a well-established discipline! Early origins can be traced to the 18th century when the analysis of birth records showed that each birth has a slightly larger probability of being male than female!

4. Assume nothing!

In hypothesis testing, we always start with an assumption that no difference exists between the populations. We do this to reduce the risk of introducing any bias into our testing. This is called the null hypothesis. We can expand on the example of male to female birth ratio to look at vitamin C supplements. Our null hypothesis could be that there is no difference in gender birth ratio between women who do and do not take vitamin C supplements. We then create an alternative hypothesis, which can typically take one of two forms. We can say that there is a difference between male and female births among women taking vitamin C supplements versus those who do not. Or we can state the direction of the difference, for example, that the population taking vitamin C supplements have more female births than those not taking the supplements.

5. Hypothesis testing workflow

There are many ways to perform hypothesis testing, but a general workflow is: First, we decide on populations we want to analyze the difference between, in this case adult women using or not using vitamin C supplements. Then, we develop null and alternative hypotheses, that births are equally likely to be male or female in both populations, or that babies are more likely to be female in women taking vitamin C supplements. Now we collect our sample data. Specifically, we collect gender status of babies born in both populations. We then perform statistical tests on the sample data. Finally, we use the results to draw conclusions about the population that the sample represents.

6. How much data do we need?

So how many births do we need to record the gender of? Applying the central limit theorem, with a larger sample size the mean number of male and female births approaches the population means. However, collecting large samples can take a lot of time and resources! A common approach is to look at peer-reviewed research on similar hypothesis tests to find out how large the samples were. This can then serve as a benchmark.

7. Independent and dependent variables

A note on terminology. In hypothesis testing, we define the data in terms of the difference we expect to observe in the alternative hypothesis. The independent variable describes data we expect will not be affected by other data. For our vitamin C and gender birth ratio hypothesis test, this would be vitamin C supplementation, meaning it is independent of male to female birth ratio. The dependent variable is the data we expect to be affected by other values. In the alternative hypothesis, we propose that birth gender ratio will be affected by vitamin C supplementation, thus it is dependent on vitamin C. These terms are commonly used when describing the results of hypothesis tests, as well as when visualizing results such as on a scatter plot, where the independent variable is always on the x-axis and the dependent variable is on the y-axis.

8. Let's practice!

Time to check our understanding of hypothesis testing!