Get startedGet started for free

Linear regression

1. Linear regression

We've already finished our core content for this course, but before closing, we will take a quick look at linear regression and logistic models as applications of probability and statistics in data science.

2. Linear functions

Let's start with a linear function. A linear function is a constant relationship between an independent variable x and a dependent variable y that is represented by a line.

3. Linear function parameters

The relationship is expressed with two parameters, the slope and the intercept value. When x equals 0, if we apply the line formula we get the intercept value. In our example, the slope is 1.5 and the intercept is 10. Now consider what would happen if we were to add a random number to the value of the function.

4. Linear function with random perturbations

Our values are not on the line anymore. Some real data has similar behavior. Now let's go backwards.

5. Start from the data and find a model that fits

Imagine we have data that shows the relationship between hours of study and students' scores on a test. You can see in the plot that when the hours of study increase, the scores also increase. The idea is to determine if a linear model fits.

6. What model will fit the data?

We might ask ourselves a few questions, like: What model will fit the data? What criteria can we use to determine which is the best model?

7. What model will fit the data? (Cont.)

What are the parameters of such a model? Let's assume that the model is linear and try to answer the other two questions.

8. Residuals of the model

In the plot, the data are the blue dots and the green line is a linear model. These represent the difference between the data points and the model's predictions. The red lines are the distance between the data and the model. All the red lines are the residuals of the linear model.

9. Minimizing residuals

If we calculate the residuals and add them up, we can start looking for the slope and intercept that minimize the residuals. That is the foundation for many models in data science: we look for the model parameters that minimize the distance between the model and the data.

10. Probability and statistics in action

An interesting link between probability and the linear model is that to apply this model to data you must study the distribution of the residuals and its variance. The distribution of the residuals should be normal with constant variance. Otherwise, the linear model is not a good fit. Let's code a bit.

11. Calculating linear model parameters

To get the parameters from a model we will use the LinearRegression class from sklearn dot linear_model. We use the provided data for hours of study and scores. Then we get the slope and intercept in model dot coef_ and model dot intercept_. In our case the slope is 1.5 and the intercept is 52.45. Now let's predict with our model.

12. Predicting scores based on hours of study

After fitting the model, we can predict the scores based on hours of study. If we want to predict the score for someone who studies a certain number of hours, we call model dot predict and pass an array with the values we want to evaluate. For 15 hours we get 74.90 as the predicted score. Now let's plot our model.

13. Plotting the linear model

We first import matplotlib dot pyplot as plt. We use plt dot scatter to plot the data in hours_of_study and scores, and we use plt dot plot to plot the provided values and model dot predict to generate the predicted scores. Then we show our plot.

14. Plot the linear model (Cont.)

The result is this plot with a linear relation between hours of study and scores, with minimal error.

15. Let's practice with linear models

Now, let's move on and practice with linear models.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.