Get startedGet started for free

Regression analysis

1. Regression analysis

Hello and welcome!

2. Regression analysis

The majority of survey analysts use regression analysis to understand the relationship between variables, which can be utilized to predict a precise outcome. It provides an opportunity to gauge the influence of different independent variables on a dependent variable. This technique finds potential future opportunities and risks, reduces huge piles of raw data into actionable information, and provides factual support for making informed decisions.

3. Linear regression using ordinary least squares (OLS) method

Linear regression is the most widely known modeling technique. It establishes the relation between a dependent variable, y, and at least one independent variable, x. It assumes that the relationship between the two variables takes the form, y=m*x+b, that is, that there is a linear relationship between the variables x and y. In the statsmodels module in Python, linear regression is performed by the OLS, or Ordinary Least Squares, method. This method calculates values of the slope, m, and the y-intercept, b, such that the total sum of squares of the difference between the calculated and observed values of y, is minimized.

4. Loading data

Let's analyze a survey that gathers respondents workout time and weight loss and estimate how many calories one can expect to burn after 30 minutes. First, we load the relevant packages along with our data. The first few survey entries show us the workout in minutes, and the calories burned.

5. Define variables

Let's define the independent variable, x, and dependent variable, y. Here, workout_minutes will be our independent variable, and calories_burned will be our dependent variable. We turn each into a list, using the dot-tolist() function.

6. Add constant term

Because we want to include a constant y-intercept, with OLS, we use the add_constant method to add a constant term to our linear equation. By using the add_constant method on x, we are taking the x array and returning a new two-d array with a column of ones inserted at the beginning. This tells the model to fit a value for b for our predictors.

7. Perform regression and fit

Once we fit our regression line to the data, the summary method is used to obtain a table which gives an extensive description about the regression results.

8. Retrieving m and b

For our purposes, we are concerned with the coefficient of x1, 1-point-0072, which is our slope value, and the constant term, zero-point-1552, which is our y-intercept value. Based on the survey, we can expect to burn approximately one calorie per minute of exercise.

9. Plot original values

We plot the original values by calling the scatter function from matplotlib-dot-pyplot on our two lists, x and y.

10. Plotting the regression line

To get the range of data values for plotting the regression line, we use dot-max and dot-min on the minutes column, and then np-dot-arange on these values to create an array of evenly spaced values from the minimum and maximum minutes surveyed. Substituting the m and b values we retrieved earlier into the linear equation, we plot the regression line using the plt-dot-plot function. The r in quotes tells Python to plot a red line. The plt-dot-show function will allow us to see the graph.

11. Predict response

Based on these survey results, to predict the number of calories someone burned in 30 minutes, we substitute our calculated slope and y-intercept into the original equation, and find that approximately 30 calories will be burned in 30 minutes.

12. Linear regression pros and cons

An advantage of linear regression is that it performs very well when the data is linearly separable. Unfortunately, it will also assume a linear relationship between dependent and independent variables, even when it isn't the case.

13. Let's practice!

Now let's practice some regression analysis!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.