1. ANOVA, single, and multiple factor experiments
Now, let's explore a few key concepts in experimental design: the ANOVA test for 3+ groups, single and multiple factor experiments and the importance of completely randomized design. I'll also introduce the open dataset we'll be using for this chapter, which is data from Lending Club, a loan company.
2. ANOVA
So far, we've done some basic comparative experiments where we examined the difference in mean between two groups (or two time periods of the same group) using a t-test. What do we do if we have more than two groups to compare?
The answer is an ANOVA test, which stands for Analysis of Variance. It allows us to compare the means of three or more groups, though there's a bit of a catch--we'll only know if at least one of the means is different from the others, but not which mean specifically.
The test is still informative and can be implemented in R in a few different ways. In the first, you build a model object with the lm() or glm() functions and then call anova() on that model object. The second implementation, the aov() function, calls lm() internally and both builds the model object for you and conducts the ANOVA test.
3. Single factor experiments
A single factor experiment is, like the model example from the last slide, an experiment with one possible explanatory variable. In this example, model 1 is a linear regression model with some outcome variable y and explanatory factor variable x (a single factor).
Ideally, a single factor experiment also has a completely randomized design, which means there's some structure in your experiment: if applicable, subjects are randomly assigned to the treatment or control group. A classic, textbook example of a single factor completely randomized design is testing cotton fabric strength. You can test the tensile strength of different cotton fabrics in a random order, then all that differs is the percent cotton in the fabric.
4. Multiple factor experiments
A multiple factor experiment expands on the single factor experiment and includes multiple possible explanatory factor variables that may be influencing the outcome variable. This might be an experiment that takes into account not just how much Vitamin C a guinea pig was given, but also the delivery method (a callback to our tooth growth example from chapter 1.)
5. Intro to Lending Club data
Across this chapter, we'll use an open dataset from the loan company Lending Club, as downloaded from Kaggle. It's a fairly large dataset with about 890,000 observations and 75 variables, so often in the exercises you'll work with a subset of this dataset. The outcome we'll be most interested in is the amount of loan funded. We'll test different explanatory variables that may influence the amount plus analyze an A/B test using this data.
6. Let's practice!
Let's do some exploratory data analysis on the Lending Club data and begin to explore some single and multiple factor experiments.