Get startedGet started for free

Explaining house price with year & size

1. Explaining house price with year & size

In Chapter 2, you created your first basic regression models that incorporated one explanatory/predictor variable x at a time. In this chapter on *multiple* regression, we now consider regression models that incorporate *more* than one x variable. You'll use the Seattle house-prices dataset we introduced in Chapter 1.

2. Refresher: Seattle house prices

Recall you performed EDA's of the relationship of price with variables like: sqft_living, a measure of a house's size condition of the house, a categorical variable with 5 levels, and whether the house had a view of the waterfront.

3. Refresher: Price and size variables

You also saw that the outcome variable price was right-skewed, as evidenced by the long right tail in the left histogram. This skew was caused by a small number of very expensive houses, resulting in a plot where its difficult to compare prices of the less expensive houses. This was also the case for the explanatory variable sqft_living on the right.

4. Refresher: log10 transformation

You unskewed both variables using a log10-transformation, rendering them in this case both more symmetric and bell-shaped. Also, recall that a log10-price of 6 corresponds to a price of 10^6 = 1 million.

5. Refresher: Data transformation

The dplyr mutate() code that created the log10-transformed variables is shown here. For the rest of this course, you'll assume that that this code was run, hence both the variables log10-price and log10-size exist in house_prices.

6. Model for house price

Let's explore the relationship between price and two explanatory variables: house size and year built. As always before any modeling, let's perform an EDA. While a scatterplot displays the relationship between two numerical variables, here you have three numerical variables. How can you visualize their joint relationship? By using a 3D scatterplot!

7. Exploratory visualizing of house price, size & year

Here's a display of the 3D scatterplot for 500 randomly chosen houses. The outcome variable log10-price is on the vertical axis, while the two explanatory variables are on the bottom grid. Now, how can you visually summarize the relationship between these points? In Chapter 2 when we had a 2D scatterplot, we used a regression line. The generalization of regression line in a 3D scatterplot is a regression plane!

8. Regression plane

Here's a snapshot of the corresponding regression plane that cuts through the cloud of points and "best fits" them. Unfortunately, this snapshot is non-interactive. For an interactive version, click on the above link. These 3D visualizations were created with the plotly package, which is a topic that would take too long to cover in this course, so our exercises won't involve creating such plots. How can you quantify the relationships between these variables? By obtaining the values of the fitted regression plane!

9. Regression table

Similarly as in Chapter 2, you fit the model using the lm() function, but now with a model formula of form: y tilde x1 PLUS x2, where the plus indicates you are using more than one explanatory variable. You get the regression table as before. The intercept here has no practical meaning, in particular since there are no houses in Seattle built in the year 0! The first slope coefficient suggests that "taking into account all other variables", increases of 1 in log10_size are associated with increases of on average 0.913 in log10-price. In other words, taking into account the age of the home, larger homes tend to cost more. Note, you preface your statement with "taking into account all other variables" since you are now jointly interpreting the associated effect of *multiple* explanatory variables in the same model. Similarly, taking into account, or "controlling for", log10_size, for every additional year in recency of construction, there is an associated decrease of on average -0.00138 in log10-price, suggesting that, taking into account house size, newer houses tend to cost less.

10. Let's practice!

Your turn! You'll create your first multiple regression model for price, using year built and the number of bedrooms as explanatory variables.