1. Using logistic regression
Linear regression works well when the dependent variable, or the variable you're interested in comparing, is continuous, such as base salary. The power of the regression comes from its ability to fit a straight line to the data points with as little error as possible.
2. Binary dependent variable
Using a linear regression model is less effective when the dependent variable is limited to two values. Variables that can only have two values are called binary variables. Many variables of this type are yes/no variables, such as the one in this graph. Employees who are high performers are on the top line, and employees who are not are on the lower line. In this example, we have salary as the independent variable along the x-axis.
3. Linear fit on binary outcome
A linear model will try to draw a straight line through the points the best it can. You can see that most of the points are far from the red line, which means the model isn't appropriate for this kind of data. The model is an especially poor fit for high and low salaries.
4. Linear vs logistic fits
Logistic regression is better suited for data with a binary dependent variable. You can see that the blue line, the logistic fit, curves appropriately to be a much closer fit to the data, even at the extreme salary levels. In this chapter, the dependent variable you're analyzing is high_performer, a binary variable. You'll need to use logistic regression to check whether the difference in high performance ratings being given to men and women in this dataset can be explained by an omitted variable.
5. Using glm() for logistic regression
The syntax for logistic regression is a little bit different than the syntax for linear regression. First, you use the glm() function instead of lm(). GLM stands for generalized linear model, and glm() can be used to fit a wide variety of models. To specify the type of regression model, use the "family" argument. For a logistic regression, use family = "binomial".
6. Multiple logistic regression
Multiple logistic regression works much like multiple linear regression. You can take additional factors into account by appending them with the plus sign. And, just like before, you can check the p-value of the variable you're comparing to see if the result of the test is significant.
We won't review how to interpret the estimates or coefficients from a logistic regression in this course, but there are two things you do need to know.
First, logistic regression coefficients do not have the same meaning as the coefficients produced by a linear regression.
Second, you can interpret the sign of the coefficients -- that is, whether they are positive or negative -- the way you expect. For example, if one of the variables is significant, as salary is here, you can say that employees with higher salaries are significantly more likely to be high performers because the estimate on salary is positive. A negative sign would be interpreted to mean that employees with higher salaries are significantly less likely to be high performers.
7. Let's practice!
Now it's time to fit a logistic regression model on the performance data.