A tale of two variables

1. A tale of two variables

Hi, I'm Richie. Welcome to the course. You'll be learning about relationships between pairs of variables. Let's start with a real example.

2. Swedish motor insurance data

This dataset on Swedish motor insurance claims is as simple as it gets. Each row represents a region in Sweden, and the two variables are the number of claims made in that region, and the total payment made by the insurance company for those claims, in Swedish krona.

3. Descriptive statistics

This course assumes you have experience of calculating descriptive statistics on variables in a data frame. For example, calculating the mean of each variable. There are many ways to do this in R; the code shown uses the dplyr package. The course also assumes you understand the correlation between two variables. Here, the correlation is zero-point-eight-eight, a strong positive correlation. That means that as the number of claims increases, the total payment typically increases as well.

4. What is regression?

Regression models are a class of statistical models that let you explore the relationship between a response variable and some explanatory variables. That is, given some explanatory variables, you can make predictions about the value of the response variable. In the insurance dataset, if you know the number of claims made in a region, you can predict the amount that the insurance company has to pay out. That lets you do thought experiments like asking how much the company would need to pay if the number of claims increased to two hundred.

5. Jargon

The response variable, the one you want to make predictions on, is also known as the dependent variable. These two terms are completely interchangeable. Explanatory variables, used to explain how the predictions will change, are also known as independent variables. Again, these terms are interchangeable.

6. Linear regression and logistic regression

In this course we're going to look at two types of regression. Linear regression is used when the response variable is numeric, like in the motor insurance dataset. Logistic regression is used when the response variable is logical. That is, it takes TRUE or FALSE values. We'll limit the scope further to only consider simple linear regression and simple logistic regression. This means you only have a single explanatory variable.

7. Visualizing pairs of variables

Before you start running regression models, it's a good idea to visualize your dataset. To visualize the relationship between two numeric variables, you can use a scatter plot. The course assumes your data visualization skills are strong enough that you can understand the ggplot code written here. If not, try taking one of DataCamp's courses on ggplot before you begin this course. In the plot, you can see that the total payment increases as the number of claims increases. It would be nice to be able to describe this increase more precisely.

8. Adding a linear trend line

One refinement we can make is to add a trend line to the scatter plot. A trend line means fitting a line that follows the data points. In ggplot, trend lines are added using geom_smooth(). Setting the method argument to "lm", for "linear model" gives a trend line calculated with a linear regression. This means the trend line is a straight line that follows the data as closely as possible. By default, geom_smooth() also shows a standard error ribbon, which I've turned off by setting se to FALSE. The trend line is mostly quite close to the data points, so we can say that the linear regression is a reasonable fit.

9. Course flow

Here's the plan for the course. First, we'll visualize and fit linear regressions. Then we'll make predictions with them. Thirdly, we'll look at ways of quantifying whether or not the model is a good fit. In the final chapter, we'll run through this flow again using logistic regression models.

10. Let's practice!

Let's get started.

This exercise is part of the course

Introduction to Regression in R

IntermediateSkill Level

4.8+

Start Course for Free

You’ll learn the basics of this popular statistical model, what regression is, and how linear and logistic regressions differ. You’ll then learn how to fit simple linear regression models with numeric and categorical explanatory variables, and how to describe the relationship between the response and explanatory variables using model coefficients.

Exercise 1: A tale of two variables

Current Exercise

Exercise 2: Which one is the response variable?Exercise 3: Visualizing two variables Exercise 4: Fitting a linear regression Exercise 5: Estimate the intercept Exercise 6: Estimate the slope Exercise 7: Linear regression with lm()Exercise 8: Categorical explanatory variables Exercise 9: Visualizing numeric vs. categorical Exercise 10: Calculating means by category Exercise 11: lm() with a categorical explanatory variable

In this chapter, you’ll discover how to use linear regression models to make predictions on Taiwanese house prices and Facebook advert clicks. You’ll also grow your regression skills as you get hands-on with model objects, understand the concept of "regression to the mean", and learn how to transform variables in a dataset.

Exercise 1: Making predictions Exercise 2: Predicting house prices Exercise 3: Visualizing predictions Exercise 4: The limits of prediction Exercise 5: Working with model objects Exercise 6: Extracting model elements Exercise 7: Manually predicting house prices Exercise 8: Using broom Exercise 9: Regression to the mean Exercise 10: Home run!Exercise 11: Plotting consecutive portfolio returns Exercise 12: Modeling consecutive returns Exercise 13: Transforming variables Exercise 14: Transforming the explanatory variable Exercise 15: Transforming the response variable too

In this chapter, you’ll learn how to ask questions of your model to assess fit. You’ll learn how to quantify how well a linear regression model fits, diagnose model problems using visualizations, and understand the leverage and influence of each observation used to create the model.

Exercise 1: Quantifying model fit Exercise 2: Coefficient of determination Exercise 3: Residual standard error Exercise 4: Visualizing model fit Exercise 5: Residuals vs. fitted values Exercise 6: Q-Q plot of residuals Exercise 7: Scale-location Exercise 8: Drawing diagnostic plots Exercise 9: Outliers, leverage, and influence Exercise 10: Leverage Exercise 11: Influence Exercise 12: Extracting leverage and influence

Learn to fit logistic regression models. Using real-world data, you’ll predict the likelihood of a customer closing their bank account as probabilities of success and odds ratios, and quantify model performance using confusion matrices.

Exercise 1: Why you need logistic regression Exercise 2: Exploring the explanatory variables Exercise 3: Visualizing linear and logistic models Exercise 4: Logistic regression with glm()Exercise 5: Predictions and odds ratios Exercise 6: Probabilities Exercise 7: Most likely outcome Exercise 8: Odds ratio Exercise 9: Log odds ratio Exercise 10: Quantifying logistic regression fit Exercise 11: Calculating the confusion matrix Exercise 12: Measuring logistic model performance Exercise 13: Accuracy, sensitivity, specificity Exercise 14: Congratulations