1. The modeling problem for prediction
You’ll finish this first chapter by studying the modeling problem for prediction. While the mechanics for predictive modeling might be similar to those for explanatory modeling, you'll see there are some subtle differences in the goals.
2. Modeling problem
Recall the modeling for explanation problem: that the f and epsilon components are unknown and that you only observe the y and x components. Based on y and x, you fit a model f-hat that hopefully closely approximates the true f while ignoring the error epsilon. In other words you want the fitted model to separate the signal from the noise. You then use this fitted model to obtain fitted/predicted values of y called y-hat.
3. Difference between explanation and prediction
In explanatory modeling, the form of f-hat matters greatly, in particular any values that quantify the relationship between y and x. For example, for every increase in 1 in instructor age, what is the typical associated change in teaching score?
However, in predictive modeling, you don't care so much about the form of f-hat, but more that it yields good predictions. So, if I give inputs x to f-hat, can I get a prediction y-hat that is close to the true value of y?
Let's build our intuition about predictive modeling through a further EDA of house prices. However, instead of using a numerical explanatory variable, let's use a categorical predictor variable, house condition.
4. Condition of house
Let's glimpse() just the variables price and condition. Condition is a categorical variable with 5 levels, where 1 indicates poor and 5 indicates excellent. Note that while condition is a number between 1 and 5, observe they are represented in R as fct, or factors, so they are treated as categorical.
5. Exploratory data visualization: boxplot
Since the original price variable was right-skewed, recall you applied a log10-transformation to unskew them.
Now, how can you visualize the relationship between the numerical outcome variable log10-price and the categorical variable condition?
Using a boxplot! You use geom_boxplot(), where x maps to condition and y maps to log10-price.
6. Exploratory data visualization: boxplot
Observe.
For each condition level, the 25th/75th percentiles are marked by ends of the boxes, while the medians are marked with solid horizontal lines. You also observe outliers.
As the house condition goes up, as expected, there is a corresponding increasing trend in the median log10-price. Furthermore for each condition, there is variation in log10-price as evidenced by the lengths of the boxplots.
Let's now also summarize each group by computing their means. While both MEDIAN and mean are measures of center, means are at the heart of the modeling techniques we'll cover in the next chapter.
7. Exploratory data summaries
Recall from earlier courses that to obtain summaries of log10-price split by condition, you first group_by() condition, and then summarize(). You summarize() using the mean of the log10-price, the standard deviation, and also the sample size using the n() function, which simply counts the number of rows in each condition level.
Observe. The group-level means exhibit a similar increasing pattern as the medians in the boxplot from before. There is also variation within each condition level as quantified by the standard deviation. Lastly most houses are either condition 3, 4, or 5.
Let's start predicting. Say a house is put on the Seattle market and you only know that its condition is 4. A reasonable prediction of its log10-price is the group mean for the condition 4 houses of 5.65.
8. Exploratory data summaries
To obtain this prediction in dollars however, you undo the log10-transform by raising 10 to the power 5.65 to get a predicted sale price of $446k.
But given the variation in prices within the condition 4 group, not all houses of condition 4 are exactly $446k. In other words, this prediction is bound to have some error.
Using our earlier terminology, the value 5.65 can be thought of as the "signal" and any difference between this prediction and the actual log10-price can be thought of as the "noise".
9. Let's practice!
Let's close out this introductory chapter with further exercises on exploratory data analysis for modeling for prediction.