1. Explaining teaching score with gender
Let's now expand your modeling toolbox with basic regression models where the explanatory/predictor variable is not numerical, but rather categorical. Much of the world's data is categorical in nature and its important to be equipped to handle it. You'll continue constructing explanatory and predictive models of teaching score, now using the variable gender, which at the time of this study was recorded as a binary categorical variable.
2. Exploratory data visualization
As we similarly did in Chapter 1 for house price and house condition, let's construct an exploratory boxplot of the relationship between score and gender.
3. Boxplot of score over gender
You can easily compare the distribution of scores for men and women using single horizontal lines. For example, it seems male instructors tended to get higher scores as evidenced by the higher median. Remember, the solid line in boxplots are the median, not the mean.
So before I perform any formal regression modeling, I expect men will tend to be rated higher than women by students. Let's make a mental note: treating the median for women of about 4.1 as a "baseline for comparison", I observe a difference of about +0.2 for men. These aren't exact values, just a rough eyeballing.
4. Facetted histogram
An alternative exploratory visualization is a faceted histogram. You use geom_histogram(), where x maps to the numerical variable score, and where we now have facets split by gender.
5. Facetted histogram
Unlike the boxplots, you now get a sense for the shape of the distributions. They both exhibit a slight left-skew. Nothing drastic like the right-skew of Seattle house prices, but still a slight skew nonetheless.
However, it's now harder to say which distribution is centered at a higher point. This is because the median isn't clearly marked like in boxplots. Furthermore, comparisons between groups can't be made using single lines.
So which plot is better? Boxplots or facetted histograms? There is no universal right answer; it all depends on what you are trying to emphasize to the consumers of these visualizations.
6. Fitting a regression model
You fit the regression as before where the model formula y tilde x in this case has x set to gender. Using get_regression_table(), you see again the regression table yields an fitted intercept and a fitted slope, but what do these mean when the explanatory variable is categorical?
The intercept 4.09 is the average score for the women, the "baseline for comparison" group. Why in this case is the baseline group female and not male? For no other reason than "female" is alphabetically ahead of "male".
The slope of 0.142 is the difference in average score for men relative to women. This is known as a "dummy" or "indicator variable".
Its not that men had an average score of 0.142, rather they differed on average from the women by +0.142, so their average score is the sum of 4.09 + .142 = 4.23.
7. Fitting a regression model
Let's convince ourselves of this by computing group means using the group_by() and summarize() verbs.
So, the latter table shows the means for men and women separately, whereas the regression table shows the average teaching score for the baseline group of women, and the relative difference to this baseline for the men.
8. A different categorical explanatory variable: rank
Let's now consider a different categorical variable: rank. Let's group the evals data by rank and obtain counts using the n() function in the summarize call.
Observe three levels: First teaching instructors responsibilities' lie primarily only in teaching courses. Second, tenure track faculty also have research expectations and, generally speaking, are on the path to getting promoted to the third rank and highest, tenured.
9. Let's practice!
Your turn! Instead of modeling teaching score as a function of the categorical variable gender which has two levels, let's use the categorical variable rank which has 3 levels.