1. Linear regression
In the last chapter,
2. Quantifying trends
you learned to visualize the trend of the “% yes” metric over time for individual countries, and see that Afghanistan’s agreement has generally going up while the United States has been going down. However, while it’s easy to recognize this trend visually, we haven’t yet quantified it. In this chapter, we’re going to learn to model this trend with a linear regression,
3. Linear regression
finding a “best fit” line for each country. For example, here we can see that
4. Linear regression
Afghanistan has a positive slope
5. Linear regression
and the US a negative slope.
6. Fitting model to Afghanistan
First, you can use filter to extract the per-year data for one country, in this case Afghanistan, into its own data frame.
7. Fitting model to Afghanistan
You can then use the lm function, short for “linear model”, to fit the line. We describe the model as “percent yes, tilde, year.”
8. Fitting model to Afghanistan
Percent yes is our dependent variable, on the y-axis. Next is the tilde- in R this means “explained by”. Then we have “year”,
9. Fitting model to Afghanistan
the independent variable, on the x-axis. This says we’re modeling “percent yes explained by year.”
10. Fitting model to Afghanistan
We can examine this model using the summary function, run on the model object we created with lm. There’s a lot of output- and if you have experience in R you may recognize some of it- but we’re going to focus on the CLICK coefficient table in the middle. Each row here represents a term that’s been estimated- a y-intercept and a slope. The term we’re most interested in is the year term, also known as the slope, showing how much the year affects percent_yes. First we have an estimated slope term. In R the e-3 describes scientific notation, meaning 10 to the negative three- this makes the slope point-006. This describes a positive slope of point-6% increase in % yes each year. We may also care about the p-value, which tests for statistical significance. We won’t talk much about the details of p-values in this course, but low p-values, such as this one, generally mean we can rule out that the effect is due to chance.
Quantifying the trend is important,
11. Visualization can surprise you, but it doesn’t scale well.
because in the words of Hadley Wickham, “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you.” Now that you’ve visualized a few examples and know what you’re looking for, you can apply a model. In the course of this chapter we’ll learn to “scale” this analysis
12. Let's practice!
to compare all countries in our dataset at once.