Get startedGet started for free

Linear regression

1. Using linear regression

2. Accounting for other factors

You've seen that the difference in salary between new hires and current employees seemed to disappear when another variable was taken into account. Looking at one variable alone is usually not enough to be confident that a meaningful difference exists. It would help if there were a way to test the significance of the difference between two groups while taking one or more other factors, such as omitted variables, into account.

3. Linear regression

Luckily, there is a way. Linear regression is a powerful statistical tool that allows you to test whether one group is meaningfully different than another in a specific way, while accounting for other factors. Linear regression can also be used in forecasting, optimization problems, and measuring the impact of one variable on another. For this course, we'll focus on using regression to test the differences between groups.

4. Simple linear regression

Here is the pay data, split up by new hires and current employees. A linear regression will find an equation for the lines that best fit each set of data. Conceptually, the best fit is the line that is closer to the points than any other line.

5. Simple linear regression

In this example, the best fitting lines, shown in blue, are the mean salary for each group.

6. Simple linear regression

You can build a linear model using the lm() function. The variable you're comparing goes to the left of the tilde, and the grouping variable goes on the right. Just like a t-test, you specify that the dataset is pay, and you can use tidy() to see a clean version of the results. In the first two columns are the terms and the estimates, also known as coefficients. Here, the two terms are for new hires, which is the "new_hireYes" row, and for current employees, which defaults to the Intercept row. To interpret this output, you could say that you'd expect current employees to have an average salary of $73,424, and new hires to have an average salary $2,649 higher. This is what you saw on the graph, and you can see that it matches the average salaries.

7. Significance for linear regression

In this course, we will focus on interpreting only the estimate and the p-value columns. The p-value for new_hireYes is 0.017, which is less than 0.05. That means the average salary of new hires is significantly different than the average salary of current employees. Which group has the higher average salary? Since the estimate for new_hireYes is positive, the result of this regression is that new hires have a significantly higher salary than current employees at the 0.05 level.

8. Multiple linear regression

To test whether that result is still significant when accounting for the differences in the departments the employees are in, you can use multiple linear regression. The syntax is identical to simple linear regression, but you use the plus sign to add additional variables, such as department. The output includes two more rows than the simple regression. You can't interpret the estimates exactly the same way anymore, but since the p-value for the new hires is 0.019, which is less than 0.05, and the estimate for new_hireYes is positive, the result of the multiple regression is that new hires do earn more than current employees, even when department is taken into account.

9. Using summary()

As a final note, you can replace tidy() with summary() to get a more comprehensive output. One advantage is that the summary() function adds significance stars next to the p-values, which are in the far-right column. If the p-value has at least one star next to it, as it does here, the result is significant at 0.05 level.

10. Let's practice!

Time to tackle the next exercises using linear regression.