1. Assessing a trend line
So far, you've only been looking at scatter plots and trend lines to assess whether it is a good fit or not. While this is a first important visual clue, a second step would be to quantify a relationship between two variables more precisely.
2. Linear and logarithmic models
If we look at the linear and logarithmic trend lines from the previous lesson, we see that the latter is a better fit. To understand how a trend line is calculated and why one trend line is a better fit than another trend line, we must look under the hood of Tableau's internals.
3. Linear model
For a linear trend line, Tableau uses a so called linear regression model of the form y equals a times x plus b. This is a very basic form of regression: quantifying how x causes y to change, with a being the slope or steepness, and b the intercept or the point where the trend line crosses the y-axis. a and b are called the model coefficients.
If we want to apply the model to our particular dataset, we replace y and x by richness and distance respectively. Tableau will then calculate a and b so that the line follows the data points as close as possible.
In this case, the slope is 0 point 0038. This means that, for every meter of higher distance, species richness is expected to increase by 0 point 0038.
The intercept is 13.4, which we can also see on the plot where the trend line crosses the y-axis. The formula allows you to make predictions about species richness when you have a measure of distance.
4. Residuals and $R^2$ of linear model
For every trend line, Tableau uses a different model. Discussing them all is beyond the scope of this course, but they all have the same goal: making the distances between all observations and the trend line as minimal as possible. The distance between an observation and the trend line is called a residual. To illustrate, I've colored four residuals on this plot.
The residuals are necessary for calculating the R-squared, also known as the coefficient of determination. When you have the R-squared from a linear model, it is simply the coefficient of correlation squared.
R-squared ranges between zero and one, where one means you have a perfect fit, and a zero means your model is no better than randomness.
In this case, the R-squared is O point 33. In more practical terms, that means that 33 percent of the variation in species richness is explained by the distance. That's a rather poor fit.
5. $R^2$ of logarithmic model
Let's compare with the R-squared from the logarithmic model. This time, R-squared has a value of 0 point 59, a much higher value. That means that the logarithmic model explains more variation than the linear model, and confirms our initial idea that this is a better fit.
6. Residual standard error (RSE)
Another useful metric is the residual standard error, sometimes abbreviated as RSE. It gives an idea on the average difference between the observed values and the trend line, and has the same unit as the variable on the y axis, in this case number of species. For the linear model, the RSE is 3 point 69, which means that the model typically differs 3 to 4 species from the observed values.
The RSE for the logarithmic model is 2 point 91, which means that this model is slightly more accurate in predicting the correct number of species. Because it has the same unit, you can use the RSE to easily compare the accuracy of different models.
Recall that the standard error is used for calculating confidence intervals, which you can plot as well. It makes it easier to visualize whether the model performs equally on all data. That is the case with the logarithmic model, but not for the linear model, where the confidence interval is larger at the low and high distances.
All visual clues and calculated metrics point out that the logarithmic model is the better fit for your data.
7. p-value
A final useful model metric is the p-value. The p-value of a model tells you the chance that there is no relationship between the two variables. If the p-value is less than 0 point 05, we say that the model is statistically significant, and you can assume that there is in fact a relationship. More generally speaking, a low p-value indicates that your model fits the data well.
Note that the p-value alone doesn't tell you the whole story. Always check with a visualization of your data and other model metrics as well. Both models are significant, but the logarithmic one is clearly a better fit.
8. Let's practice!
Let's practice assessing model performance.