1
What is Regression?
Gratuit
In this chapter we introduce the concept of regression from a machine learning point of view. We will present the fundamental regression method: linear regression. We will show how to fit a linear regression model and to make predictions from the model.
2
Training and Evaluating Regression Models
Now that we have learned how to fit basic linear regression models, we will learn how to evaluate how well our models perform. We will review evaluating a model graphically, and look at two basic metrics for regression models. We will also learn how to train a model that will perform well in the wild, not just on training data. Although we will demonstrate these techniques using linear regression, all these concepts apply to models fit with any regression algorithm.
3
Issues to Consider
Before moving on to more sophisticated regression techniques, we will look at some other modeling issues: modeling with categorical inputs, interactions between variables, and when you might consider transforming inputs and outputs before modeling. While more sophisticated regression techniques manage some of these issues automatically, it's important to be aware of them, in order to understand which methods best handle various issues -- and which issues you must still manage yourself.
4
Dealing with Non-Linear Responses
Now that we have mastered linear models, we will begin to look at techniques for modeling situations that don't meet the assumptions of linearity. This includes predicting probabilities and frequencies (values bounded between 0 and 1); predicting counts (nonnegative integer values, and associated rates); and responses that have a non-linear but additive relationship to the inputs. These algorithms are variations on the standard linear model.
5
Tree-Based Methods
In this chapter we will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are *tree-based* methods that work by combining ensembles of *decision trees* that are learned from the training data.

Initializing

Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif() (docs).

If you have a dataset dframe of size \(N\), and you want a random subset of approximately size \(100 * X\)% of \(N\) (where \(X\) is between 0 and 1), then:

Generate a vector of uniform random numbers: gp = runif(N).
dframe[gp < X,] will be about the right size.
dframe[gp >= X,] will be the complement.

Use the function nrow (docs) to get the number of rows in the data frame mpg. Assign this count to the variable N and print it.
Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.
Use runif() to generate a vector of N uniform random numbers, called gp.
Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).
Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?