Get startedGet started for free

Introduction to linear regression

1. Introduction to linear regression

Regressions aid in making data-driven predictions.

2. Regression

Suppose each subject ate a Cheese pizza or Pepperoni pizza, and we assessed the time to eat the pizza in our AB design. Regression analysis determines which factors impact a variable, such as time to eat the pizza. The variable to predict, time to eat the pizza, is the dependent variable. The independent variables are factors suspected to impact the dependent variable, such as enjoyment of the pizza, hunger level, or the visual appeal.

3. Regression line

Given a relationship, here positive, we can predict the time to eat the pizza based on the enjoyment using regression, starting with the regression line, or line of best fit, plotted in red. The regression line is the best explanation of the variables’ relationship, Y-hat, or the prediction of Y.

4. Regression line

The formula of this line is beta-zero plus beta-one times X-one plus error. Beta-zero indicates the y-intercept, or the value of Y when X is zero.

5. Regression line

Beta-one is the slope, indicating that for every increase in X-one Y increases by an average of beta-one.

6. Regression line

The error term is necessary because the dependent variable is not perfectly predicted by the independent variable, it is an estimate. The residuals, or the difference between the observed and predicted value, are used to derive the error of a model. This term indicates the certainty of the formula, where a larger error term indicates less certainty in the regression line. The error term can be reduced by including more independent variables, a method called multiple regression. Including hunger level, for instance, makes the formula Y-hat equals beta-zero plus beta-one times X-one plus beta-two times X-two plus error.

7. Predicting data

Suppose we found a beta-zero of five-point-three-two and beta-one of zero-point-zero-eight. We can use these values in the regression formula to predict the time to eat given an enjoyment of 15, for instance. To visualize this point, create a scatter plot with the dependent variable, time, on the y-axis, and an independent variable, enjoyment, on the x-axis. This presents the relationship of which variable is being assessed as impacting the other, assessing the impact of enjoyment on time to eat. Include a horizontal line with geom-underscore-hline specifying y-intercept with yhat. Use geom-underscore-vline specifying x-intercept with the x-axis point, 15, to create a vertical line.

8. Regression considerations

Remember that correlation does not equal causation. The AB groups can be inferred to cause a relationship if the correlations within each group are different. Additionally, only reasonable dependent variables should be assessed. If every relationship is assessed, non-meaningful relationships can appear and be given more meaning than exists, such as rolling the same value on a die multiple times in a row if rolled enough times. Ask what decisions will be made with the data. What actions will be taken regarding the variable? If unable to take actions, such as weather impacting sales, there is no benefit to including it in the model. Remember that bad data impacts analyses. Keep in mind the error term to assess how certain and reliable the formula is. While this may not be a big deal in all cases, each unit of change in a regression decision could be associated with and result in large expenses being spent depending on the company and variables. Analysis results should always be assessed with the situation at hand and the real world implications.

9. Let's practice!

Let's apply this.