1. A tale of two variables
Hi, I'm Richie. Welcome to the course. You'll be learning about relationships between pairs of variables. Let's start with a real example.
2. Swedish motor insurance data
This dataset on Swedish motor insurance claims is as simple as it gets. Each row represents a region in Sweden, and the two variables are the number of claims made in that region, and the total payment made by the insurance company for those claims, in Swedish krona.
3. Descriptive statistics
This course assumes you have experience of calculating descriptive statistics on variables in a data frame. For example, calculating the mean of each variable. There are many ways to do this in R; the code shown uses the dplyr package.
The course also assumes you understand the correlation between two variables. Here, the correlation is zero-point-eight-eight, a strong positive correlation. That means that as the number of claims increases, the total payment typically increases as well.
4. What is regression?
Regression models are a class of statistical models that let you explore the relationship between a response variable and some explanatory variables.
That is, given some explanatory variables, you can make predictions about the value of the response variable. In the insurance dataset, if you know the number of claims made in a region, you can predict the amount that the insurance company has to pay out.
That lets you do thought experiments like asking how much the company would need to pay if the number of claims increased to two hundred.
5. Jargon
The response variable, the one you want to make predictions on, is also known as the dependent variable. These two terms are completely interchangeable.
Explanatory variables, used to explain how the predictions will change, are also known as independent variables. Again, these terms are interchangeable.
6. Linear regression and logistic regression
In this course we're going to look at two types of regression. Linear regression is used when the response variable is numeric, like in the motor insurance dataset.
Logistic regression is used when the response variable is logical. That is, it takes TRUE or FALSE values.
We'll limit the scope further to only consider simple linear regression and simple logistic regression. This means you only have a single explanatory variable.
7. Visualizing pairs of variables
Before you start running regression models, it's a good idea to visualize your dataset. To visualize the relationship between two numeric variables, you can use a scatter plot.
The course assumes your data visualization skills are strong enough that you can understand the ggplot code written here. If not, try taking one of DataCamp's courses on ggplot before you begin this course.
In the plot, you can see that the total payment increases as the number of claims increases. It would be nice to be able to describe this increase more precisely.
8. Adding a linear trend line
One refinement we can make is to add a trend line to the scatter plot. A trend line means fitting a line that follows the data points. In ggplot, trend lines are added using geom_smooth().
Setting the method argument to "lm", for "linear model" gives a trend line calculated with a linear regression. This means the trend line is a straight line that follows the data as closely as possible.
By default, geom_smooth() also shows a standard error ribbon, which I've turned off by setting se to FALSE.
The trend line is mostly quite close to the data points, so we can say that the linear regression is a reasonable fit.
9. Course flow
Here's the plan for the course. First, we'll visualize and fit linear regressions. Then we'll make predictions with them. Thirdly, we'll look at ways of quantifying whether or not the model is a good fit. In the final chapter, we'll run through this flow again using logistic regression models.
10. Let's practice!
Let's get started.