1. Explaining house price with size & condition
Previously, you created a multiple regression model for house price using two numerical explanatory or predictor variables. However, multiple regression is not just limited to combinations of numerical variables; you can also use categorical variables!
In this video, you'll once again model log10-price as a function of the log10_size, but now consider the house's condition, a categorical variable with 5 levels as seen in Chapter 1.
2. Refresher: Exploratory data analysis
After creating the transformed log10_price and log10_size variables, let's review the EDA we performed in Chapter 1 of the relationship between log10_price and condition.
3. Refresher: Exploratory data analysis
You previously saw: a roughly increasing trend in the mean as condition goes from 1 to 5, variation within each of the five levels of condition, and the fact that most houses are of conditions 3, 4, or 5. Let's continue this EDA with some more exploratory visualizations.
4. House price, size, and condition
Here you plot all 21k points in a colored scatterplot, where
-x maps to the log10_size
-y maps to log10_price and
-the colors of the points map to the condition.
You also plot the overall regression line in black, in other words, the regression line for all houses irrespective of condition.
However, wouldn't it be nice if you had separate regression lines for each color in other words each condition level? This would allow us to consider the relationship between size and price separately for each condition. Let's do this!
5. Parallel slopes model
Here's the same colored scatterplot, but now with five separate regression lines. Note, we’re keeping things simple by having all 5 lines have the same slope, but allowing for different intercepts. Observe houses of condition 5 have the highest regression line, followed by 4, 3, 1, and then 2. This is known as the "parallel slopes" model.
6. Parallel slopes model
An alternative visualization is one split by facets. This plot really brings to light that there are very few houses of condition 1 or 2. However, comparing the 5 regression lines here is harder than before. Which plot is better? Again, there is no universal right answer, you need to make a choice depending on what you want to convey, and own it.
7. House price, size, and condition relationship
Let's explicitly quantify these relationships by looking at the regression table. You once again fit a regression using a formula where the plus sign separates our two explanatory variables and apply the get_regression_table() function to it.
Recall the notion of a "baseline for comparison" level when using a categorical variable in a regression model. In this case, the baseline group is houses of condition = 1. Let's interpret the terms.
The fitted intercept of 2.88 corresponds to the intercept of the baseline group, which are the condition 1 houses in red in the previous plot.
The fitted slope 0.837 for log10-size corresponds to the associated average increase in log10-price for every increase of one in log10-size. Recall, the parallel slopes model dictates that all 5 groups have this same slope.
condition2 = -0.0385 is the difference in intercept, or the offset, for condition 2 houses relative to condition 1 houses, a negative value. This is reflected in the previous plot by the fact that the yellow regression line has a lower intercept than the red line.
Conditions 3, 4, and 5 are interpreted similarly. Observe that this offset is largest for condition 5 at +0.0956. This is reflected in the previous plot by the purple regression line having the highest intercept.
8. Let's practice!
Your turn! You'll model price using the same numerical variable, but with a different categorical variable: whether the house had a view of the waterfront.