Linear regression with tidymodels

1. Linear regression with tidymodels

In this section, we will fit our first machine learning model, linear regression!

2. Model fitting with parsnip

Within the tidymodels ecosystem, the parsnip package is used for fitting models and calculating predictions.

3. Linear regression model

Linear regression estimates the outcome variable as a linear function of the predictor variable. If we are predicting hwy using cty as a predictor from the mpg dataset, then the functional form of the model is written as hwy equals beta0 plus beta1 times cty. Beta0 and beta1 are known as model parameters, and represent the intercept and slope of the line, respectively.

4. Linear regression model

The model parameters are estimated using the training data. The intercept and slope were estimated to be 0 point 77 and 1 point 35, respectively using the mpg_training data. The blue line in the plot graphs our estimated regression line with the mpg training data values.

5. Model formulas

Before parsnip can fit a model to data, it requires columns to be assigned to either an outcome or predictor role. This is done with R formulas and follows the general form of outcome variable on the left followed by a tilde and then by one or more predictor variables separated by plus signs. To use all available columns in a data frame as predictors, the shorthand notation of outcome tilde dot can be used. To predict hwy using cty, we would use hwy tilde cty in our model formula.

6. The parsnip package

The parsnip package provides a unified syntax for model specification in R. Building a parsnip model object involves specifying the model type, such as linear regression, the computational engine, which specifies the underlying package that will be used to fit the model, and the mode which is either regression or classification. With parsnip, it is possible to fit a linear regression with the traditional 'lm' engine provided in base R or the 'stan' engine, which estimates the model parameters using Bayesian parameter estimation. The power of parsnip is that it combines a large number of machine learning packages that fit the same model type using common syntax.

7. Fitting a linear regression model

To fit our linear regression model, we start by defining a parsnip model object named lm_model. We use the linear_reg() function to create a linear regression model object. Then we pass it into the set_engine function where we specify the 'lm' engine. Finally we pass the results into the set_mode() function where we specify 'regression' since we are predicting a numeric outcome variable. To train our model, we pass lm_model to the fit() function and provide a model formula and the data on which to train the model.

8. Obtaining the estimated parameters

Once the model is trained it can be passed into the tidy() function to create a model summary tibble, which is a specialized data frame used in tidymodels. The term and estimate columns provide the estimated model parameters.

9. Making predictions

Model predictions are obtained by passing the trained model, lm_fit, to the predict() function. The new_data argument specifies the dataset on which to predict new values and is typically set to the testing dataset. The predict() function returns standardized output that is always a tibble, has the same row order as the data in the new_data argument, and a column named dot-pred with model predictions.

10. Adding predictions to the test data

To evaluate model performance, we will need to add the model predictions to the test dataset. The bind_cols() function can be used to combine multiple data frames along the column axis. First we select the hwy and cty columns in mpg_test and pass this to the bind_cols() function where we add the hwy_predictions tibble.

11. Let's model!

Let's practice fitting linear regression models!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.