The importance of EDA: Anscombe's quartet
1. The importance of EDA: Anscombe's quartet
In 1973, statistician Francis Anscombe published a paper that contained2. Anscombe's quartet
four fictitious x-y data sets, plotted here. He uses these data sets to make an important point. That point becomes clear if we blindly go about doing parameter estimation on these data sets. First, let's look at the average x-values of the four data sets.3. Anscombe's quartet
They are all the same. How about the average y-values?4. Anscombe's quartet
Again, all the same. And what if we do a linear regression on each of the data sets?5. Anscombe's quartet
They all have the same line! Surely some of the fits are less optimal than others. Let's look at the sum of the squares of the residuals.6. Anscombe's quartet
Oh my, they are all basically the same as well. Of course, Anscombe constructed the data sets so that this would happen. The point he was making is very important. You already have some powerful tools for statistical inference. You can compute summary statistics and optimal parameters, including linear regression parameters, and by the end of the course, you will able to construct confidence intervals with quantify uncertainty about the parameter estimates. These are crucial skills for any data analysis, no doubt.7. Look before you leap!
But look before you leap! This is a powerful reminder to do some graphic exploratory data analysis before you start computing and making judgments about your data. For example,8. Anscombe's quartet
this data set might be well modeled with a line, and the regression parameters will be meaningful. The same is true of9. Anscombe's quartet
this data set, but the outlier throws off the slope and intercept. After doing EDA, you should look into what is causing that outlier.10. Anscombe's quartet
This data set might also have a linear relationship between x and y, but from the plot, you can conclude that you should try to acquire more data for intermediate x values to make sure that it does.11. Anscombe's quartet
And this data set is definitely not linear, and you need to choose another model. Explore your data first. I'll let you prove to yourself12. Let's practice!
that these data sets give the same regression parameters. It will be good practice, and seeing is believing!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.