Get startedGet started for free

Introduction to regression

1. Introduction to regression

Now we're going to check out the other type of supervised learning: regression. In regression tasks, the target variable typically has continuous values, such as a country's GDP, or the price of a house.

2. Predicting blood glucose levels

To conceptualize regression problems, let's use a dataset containing women's health data to predict blood glucose levels. We load the dataset as a pandas DataFrame, and print the first five rows. It contains features including number of pregnancies, triceps skinfold measurements, insulin levels, body mass index, known as BMI, age in years, and diabetes status, with one indicating a diagnosis, and zero representing the absence of a diagnosis.

3. Creating feature and target arrays

Recall that scikit-learn requires features and target values in distinct variables, X and y. To use all of the features in our dataset, we drop our target, blood glucose levels, and store the values attribute as X. For y, we take the the target column's values attribute. We can print the type for X and y to confirm they are now both NumPy arrays.

4. Making predictions from a single feature

To start, let's try to predict blood glucose levels from a single feature: body mass index. To do this, we slice out the BMI column of X, which is the fourth column, storing as the variable X_bmi. Checking the shape of y and X_bmi, we see that they are both one-dimensional arrays. This is fine for y, but our features must be formatted as a two-dimensional array to be accepted by scikit-learn. To convert the shape of X_bmi we apply NumPy's dot-reshape method, passing minus one followed by one. Printing the shape again shows X_bmi is now the correct shape for our model.

5. Plotting glucose vs. body mass index

Now, let's plot blood glucose levels as a function of body mass index. We import matplotlib-dot-pyplot as plt, then pass X_bmi and y to plt-dot-scatter. We'll also label our axes using the xlabel and ylabel methods.

6. Plotting glucose vs. body mass index

We can see that, generally, as body mass index increases, blood glucose levels also tend to increase.

7. Fitting a regression model

It's time to fit a regression model to our data. We're going to use a model called linear regression, which fits a straight line to our data. We will explain the mechanics of linear regression in the next video, but first, let's see how to fit it and plot predictions. We import LinearRegression from sklearn-dot-linear_model, and instantiate our regression model. As we are modeling the relationship between the feature, body mass index, and the target, blood glucose levels, rather than predicting target values for new observations, we fit the model to all of our feature observations. We do this by calling reg-dot-fit and passing in the feature data and the target variable, the same as we did for classification problems. After this, we can create the predictions variable by calling reg-dot-predict and passing in our features. As we are predicting the target values of the features used to train the model, this gives us a line of best fit for our data. We produce our scatter plot again, and then call plt-dot-plot to produce a line plot, passing our features, X_bmi, followed by our predictions.

8. Fitting a regression model

The black line represents the linear regression model's fit of blood glucose values against body mass index, which appears to have a weak-to-moderate positive correlation.

9. Let's practice!

Now let's build a regression model of our own!