1. Omitted variable bias
You've learned how to test whether the difference between two groups of employees is significant. A key assumption when comparing groups is that the groups are the same except for the variable you're testing. Why is that an important assumption?
2. When group compositions differ
To use a non-HR example, suppose you measure the weight gain of two groups of people over several months. One group eats almost no meat, but is gaining weight very quickly. In fact, the difference between the two groups for weight gain and meat consumption is statistically significant. Should people who want to gain weight quickly stop eating meat?
3. When group compositions differ
In this case, I wouldn't make any recommendation, because the group that is gaining weight so rapidly is actually made up of infants, who grow much faster than the adults in group B, and aren't able to eat much meat, if any. Any conclusion you might want to draw from comparing these two groups is made useless by the difference in the groups' age composition.
4. Omitted variable bias
When a missing variable is correlated with the dependent variable -- the variable you're comparing the groups on -- and with the way the groups are divided, you're dealing with omitted variable bias. In this case, an important variable - age - was correlated with both weight gain rate -- the dependent variable -- and with meat consumption, the way the groups were divided.
Put another way, there was a missing variable that could explain why it looked like weight gain rate and meat consumption were connected - but in fact, the connection was the age of the individuals in each group. That missing variable is also known as a confounding variable.
5. Visualizing group composition
One way to check whether two groups are similar in composition is to graph them. The specific graph you'll use depends on whether you're comparing a continuous variable, such as age, or a discrete variable, such as location or department.
6. 100% stacked bar charts
In this chapter, you'll be focusing on the composition of employee groups based on discrete variables. A good graph type to use is the 100% stacked bar chart. It looks like a regular stacked bar chart, but the y-axis is the same height for both groups, even though there are many more current employees than new hires. The 100% stacked bar chart is perfect when you are more interested in the group composition than the actual number of employees in each group.
7. 100% stacked bar charts
The syntax is similar to other bar charts you've made in this course. There are two differences. First, this plot uses geom_bar() instead of geom_col(). With geom_bar(), the y-axis is simply the count of the observations, and is not defined by an aesthetic. You can see that for this code, there is no y variable defined inside aes().
The second difference is the use of position = "fill". You've used the position argument before to create side-by-side bar charts, using "dodge". With "fill", ggplot scales the bars to be the same height, creating 100% stacked bar charts.
8. Let's practice!
Now check whether there are any omitted variables to worry about in the pay analysis.