Get startedGet started for free

Background on modeling for prediction

1. Background on modeling for prediction

Previously, you were introduced to an example of modeling for explanation: understanding what factors might explain teacher evaluation scores as given by students at the University of Texas Austin. Let's now consider an example of modeling for prediction.

2. Modeling for prediction example

The dataset I'll use is the "House Sales in King County USA" available at Kaggle.com. It consists of homes sold near Seattle Washington in 2014 and 2015. I'll predict the sale price of houses based on their features such as: size as measured by sqfeet of living space, where 1 sqft is approximately 1/10th of a sqmeter, condition, number of bedrooms, year built, and whether it had a view of the waterfront.

3. Modeling for prediction example

Just as with the evals dataset, I've included this data in the moderndive package, which I preview using the glimpse() function. Observe. There are 21k rows representing houses and 21 variables… As before, let's perform an exploratory data analysis, or EDA, of the outcome variable, price. Recall the three approaches to EDA: looking at the data, visualizations, and summary statistics. Since I've just done the first approach, let's now visualize our data.

4. Exploratory data analysis

Just as with the outcome variable score from our explanatory modeling example, let's get a sense of the distribution of our new numerical outcome variable, price, using a histogram.

5. Histogram of outcome variable

First, let's look at the x-axis tick marks. Since e+06 means 10^6, or one million, we see that a majority of houses are less than 2 million dollars. But why does the x-axis stretch so far to the right? Its because there are a very small number of houses with price closer to 8 million. You say that the variable price is "right skewed" as exhibited by the long right tail. This skew makes it difficult to compare prices of the less expensive houses. Recall that you saw something similar in the intro to the tidyverse course when visualizing the variable country population from the gapminder dataset.

6. Gapminder data

You visualized the relationship between countries' life expectancies and populations using a scatterplot similar to this one. Because the populations of the two green dots corresponding to India and China were so large, it was hard to study the relationship for less populated countries. To remedy this, you re-scaled the x-axis to be on a log10-scale.

7. Log10 rescaling of x-axis

Now you can better distinguish points for countries with smaller populations. Furthermore, horizontal intervals on the x-axis now correspond to multiplicative differences instead of additive ones. For example, distances between successive vertical white gridlines correspond to multiplicative increases by a factor of 10.

8. Log10 transformation

Now I'll do something similar, where I log10-transform price using the mutate function to create a new variable log10-price. Let's view the effects of this transformation on these two variables. Observe in particular the house in the 6th row with price 1.225 million. Since 10^6 is one million, its log10_price is 6.09. Contrast this with all other houses with log10-price less than 6. I'll treat log10-price as our new outcome variable. I can do this since log-transformations are monotonic, meaning they preserve orderings, so if house A's price is lower than house B's, then house A's log10-price will also be lower than house B's log10-price.

9. Histogram of new outcome variable

Let's take the earlier code to plot our histogram of the ORIGINAL outcome variable, copy, paste, and tweak the code to plot a histogram of the NEW log10-transformed outcome variable.

10. Comparing before and after log10-transformation

Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and bell-shaped, although this isn't always necessarily the case. You can now better discriminate between houses at the lower end of the price scale.

11. Let's practice!

Now that you've seen that a log10-transformation was warranted for the outcome variable price, let's see if the predictor variable size as measured by the square feet of living space warrants a similar transformation.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.