Get startedGet started for free

ANOVA

1. ANOVA

Hello! In this video we'll review ANOVA.

2. Introduction to ANOVA

ANOVA stands for Analysis of Variance. For me, the name sounded a little bit odd in the beginning because of the word "variance", while we're actually testing means with ANOVA. The name is appropriate because inferences about means are made by analyzing variance. Let's discuss the hypotheses of ANOVA.

3. Hypotheses

The null hypothesis states that the mean is equal across groups. The alternative hypothesis is that at least one pair differs.

4. Hypotheses

For example, we want to see if the time spent studying is different between students in primary school, secondary school, and university. The null hypothesis states that the mean is equal across groups, or that the average time spent studying is the same regardless of what school you're in.

5. Hypotheses

You see, this is something that we've had in the previous video on the two-sample t-test, but we were comparing only two means.

6. Hypotheses

Now we go big and compare multiple means at the same time. During the interview, you might be asked why not use the two-sample t-test for each of the pairs instead of performing ANOVA.

7. Type I and II errors

Before we answer this question, let's review type I and type II errors.

8. Type I and II errors

If the null hypothesis is true and we accept it or if the null hypothesis is false and we reject it, then we are correct.

9. Type I and II errors

If the null hypothesis is true, and we reject it, we make a type I error.

10. Type I and II errors

If the null hypothesis is false, and we accept it, we make a type II error. Now, let's go back to the question, why shouldn't we perform multiple t-tests?

11. Why not multiple t-tests?

Let's imagine that each dot represents a mean that will be compared to other means.

12. Why not multiple t-tests?

Every time you conduct a t-test, there is a chance that you will make a type I error. Usually, it's 5%.

13. Why not multiple t-tests?

By running two t-tests on the same data, you will increase your chance of making a type I error.

14. Why not multiple t-tests?

In this example, you are exposed to the risk of error three times.

15. Why not multiple t-tests?

ANOVA controls for these errors so that the type I error remains at 5%. That's why it is the preferred method. If you want to refresh on the mathematics behind ANOVA, look for statistical inference courses on DataCamp.

16. Assumptions

ANOVA assumes the independence of cases, normal distributions, and homogeneity of variances.

17. ANOVA in R

To run one-way ANOVA in R, you can use the oneway.test function from the basic stats package.

18. Box plot

The interviewer might ask you to do some EDA on the data first. Box plots are useful to compare distributions of several groups. A box plot presents five common statistics.

19. Box plot

The ends of the box are the 1st and 3rd quartiles.

20. Box plot

The line within the box is the median.

21. Box plot

The width of the box is the interquartile range.

22. Box plot

The ends of the whiskers are minimum and maximum.

23. Box plot

And the points are outliers. When I first learned this, the minimum and maximum statistics were confusing for me.

24. Box plot

In R by default, the minimum is the lowest datum within 1.5 interquartile range of the lower quartile, and the maximum is the highest datum still within 1.5 interquartile range of the upper quartile. Don't get confused during the interview!

25. Summary

To summarize, we covered the hypotheses of ANOVA, type I and II errors, the assumptions of ANOVA, the oneway.test function in R, and box plots.

26. Let's practice!

Let's move on to the exercises!