Simple linear regression

1. Simple linear regression

Now that we have inspected the correlations between the various variables, we'll move on to predicting the future margin with help of the margin in year 1. We chose the margin in year one since the correlation between the two variables is the highest. When I only use one independent variable for the prediction, we call the model a simple linear regression.

2. Perfect correlation vs. data cloud I

In reality, the ideal case of a perfect linear correlation, where you can exactly predict y with a given value of x, is very unlikely. Most of the time the data points are scattered around, in the form of a cloud. For this, we determine the direction of the relationship between x and y by fitting a straight line through the cloud.

3. Perfect correlation vs. data cloud II

This is what we use the *least squares estimation* procedure for. This method helps us find the *regression line* and returns its coefficients. The difference between our prediction (a point on the line) and the actual value (a data point) is called the prediction error or *residual* value. That's enough theory, let's move on to some code.

4. Model specification

We can specify the linear regression model using a formula object in the `lm` function from the *stats* package. Looking at the arguments, notice that we are looking to predict `futureMargin` as a function of `margin`, using `clvData1`. We store the model as `simpleLM`. Then we can use the `summary` function with `simpleLM` as an argument, to get an overview of the results. Take a look at the coefficient estimate for `margin`. With a value of roughly `0.65`, it is greater than 0 which means that the higher the margin in year 1 the higher we expect the future margin to be. Also, take a look at the multiple $R^2$ at the bottom of the output. A value of roughly `0.32` means that about 30 percent of the variation in the future margin can be explained by the margin in year 1. But more on that later.

5. Visualization using ggplot2

The `ggplot` function from the *ggplot2* package gives us a nice visualization of the relationship. Here, we produce a simple scatter plot of the observations using our `clvData1` dataset. The data is the first argument, and we specify `margin` as the x-axis and `futureMargin` as the y-axis in the `aes()` call. This is the second argument to `ggplot`. We also use `geom_smooth` with method equals lm to fit a linear regression line through the data cloud.

6. Assumptions of Simple Linear Regression Model

Before moving on to multiple linear regression, let's take a look at the conditions that the data must satisfy for linear regression to be the best method. - The relationship between the independent variable and the dependent variable should be linear. - The independent variable should not contain any measurement errors. - The residuals should be uncorrelated. One cause of correlation among the errors is violation of the linearity assumption. - The residuals should randomly vary around 0 and their expectation should be equal to 0. Usually this assumption is not problematic as long as a constant is included in the model. - The variance of the prediction error has to be constant. If not, inferences made from the model can be misleading. - When doing statistical significance testing, we also have to assume that the errors are normally distributed. A well-established method to check the violation of these assumptions is a plot of the predicted values against the estimated residuals. This is called a residual plot.

7. Time to practice!

Now let's try some examples.