Get startedGet started for free

The modeling problem for explanation

1. The modeling problem for explanation

Now that you have some background on modeling, let's introduce the goal of modeling for explanation and use the evals teaching score data as an example.

2. Recall: General modeling framework formula

Recall that of the elements of the general modeling framework, we previously covered the the outcome variable y and explanatory/predictor variables x. Now let's first study f, which defines an explicit relationship between y and x, and then the error component epsilon.

3. The modeling problem

Let's address some points about the modeling problem. -Usually you won't know the true form of f nor the mechanism that generates the errors epsilon. -However you will know the observations y and x, as they are given in our data -Using y and x, the goal is to construct or "fit" a model f-hat that approximates the true f, but not epsilon. -In other words, you want to separate the signal from the noise. -With the fitted model f-hat, you can apply it to x to obtain fitted or predicted values of y called y-hat. In this course, you'll keep things simple and only fit models that are linear. But first, let's now perform an EDA of the relationship between the variables in our modeling for explanation example.

4. Modeling for explanation example

Earlier you performed a univariate EDA on the outcome variable score and the explanatory variable age. By univariate we mean they only considered one variable at a time. The goal of modeling, however, is exploring relationships between variables. So how can you visually explore such relationships? Using a scatterplot!

5. EDA of relationship

You use a geom_point() to create a scatterplot with x mapped to age and y mapped to score. This will mark each instructor's age and score with a point.

6. EDA of relationship

Let's ask ourselves, is the relationship positive, meaning as professors age do they also get higher scores? Or is it negative? It’s hard to say, as the pattern isn’t super clear. Before you attempt to answer this, let's first address an issue known as overplotting. For example, focus on the point at age = 70 with the highest score of 4.6. Although not immediately apparent, there are actually not one but two perfectly superimposed points. How can you visually bring this fact to light? By adding a little random jitter to each point, meaning nudge each point just enough so you can distinguish them, but not so much that the plot is overly altered.

7. Jittered scatterplot

This is done by taking the same code that generated the scatterplot and replacing the geom_point() with a geom_jitter().

8. Jittered scatterplot

Observe there are indeed two values at age 70 and score 4.6. Other overplotted points similarly get broken up. Note that the jittering is strictly a visualization tool; it does not alter the original values in the dataset. So back to our earlier question: is the relationship positive or negative? You can answer this using the correlation coefficient.

9. Correlation coefficient

A correlation coefficient is a summary statistic between -1/1 measuring the strength of linear association of two numerical variables, or the degree to which points fall on a line. In the top left plot where the correlation is -1, the points fall perfectly on a negatively sloped line. So as values of x increase, values of y decrease in lock-step. In the bottom right plot where the correlation is +1, the relationship is perfectly positive. In the middle where the correlation is 0, there is no relationship; x and y behave independently. The remaining plots illustrate other in-between values. Let's compute the correlation coefficient for age and score!

10. Computing the correlation coefficient

The cor() function takes two numerical variables and returns the correlation, which you embed in the summarize() function. -0.107 indicates a negative relationship, meaning as professors age, they also tend to get lower scores. However, this relationship is only weakly negative.

11. Let's practice!

In the next exercise, you'll perform an EDA of the relationship between teaching score and beauty score.