Get startedGet started for free

The importance of EDA: Anscombe's quartet

1. The importance of EDA: Anscombe's quartet

In 1973, statistician Francis Anscombe published a paper that contained

2. Anscombe's quartet

four fictitious x-y data sets, plotted here. He uses these data sets to make an important point. That point becomes clear if we blindly go about doing parameter estimation on these data sets. First, let's look at the average x-values of the four data sets.

3. Anscombe's quartet

They are all the same. How about the average y-values?

4. Anscombe's quartet

Again, all the same. And what if we do a linear regression on each of the data sets?

5. Anscombe's quartet

They all have the same line! Surely some of the fits are less optimal than others. Let's look at the sum of the squares of the residuals.

6. Anscombe's quartet

Oh my, they are all basically the same as well. Of course, Anscombe constructed the data sets so that this would happen. The point he was making is very important. You already have some powerful tools for statistical inference. You can compute summary statistics and optimal parameters, including linear regression parameters, and by the end of the course, you will able to construct confidence intervals with quantify uncertainty about the parameter estimates. These are crucial skills for any data analysis, no doubt.

7. Look before you leap!

But look before you leap! This is a powerful reminder to do some graphic exploratory data analysis before you start computing and making judgments about your data. For example,

8. Anscombe's quartet

this data set might be well modeled with a line, and the regression parameters will be meaningful. The same is true of

9. Anscombe's quartet

this data set, but the outlier throws off the slope and intercept. After doing EDA, you should look into what is causing that outlier.

10. Anscombe's quartet

This data set might also have a linear relationship between x and y, but from the plot, you can conclude that you should try to acquire more data for intermediate x values to make sure that it does.

11. Anscombe's quartet

And this data set is definitely not linear, and you need to choose another model. Explore your data first. I'll let you prove to yourself

12. Let's practice!

that these data sets give the same regression parameters. It will be good practice, and seeing is believing!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.