Get startedGet started for free

Making predictions

1. Making predictions

Welcome back! Having inspected the posterior draws and verified the model was fitted successfully, we can use the model to make predictions for new data!

2. Number-of-clicks model again

Let's take a look at our number-of-clicks model once again as a reminder.

3. Ads test data

Now we have some new data in a variable named ads_test: five new observations which were not in the data we used to fit the model. For each of these new observations, we know the true number of clicks, so we can compare it with the model's predictions to see how accurate they are.

4. Sampling predictive draws

To obtain predictions for the new data, we start with a with statement calling pm-dot-model, just like we did when generating the trace. Next, inside the with statement, we call pm-dot-glm-dot-from_formula just as before. We pass the same model formula, but this time, we pass the new test data to the data argument. Finally, we call pm-dot-fast_sample_posterior_predictive on the trace generated by our model. Now, the posterior_predictive variable holds the predictive distributions for our test data. Let's take a look inside!

5. Predictive draws

pymc3 denotes the response variable in a regression model with "y", so we need to extract it from posterior_predictive. Printing its shape and contents, we see that it's a numpy array of 4000 rows and five columns. The five columns correspond to the five observations in our test data for which the predictions have been generated. In each column, that is: for each test observation, there are 4000 predictive draws, which corresponds to the settings we chose when fitting the model: 1000 draws and 4 chains.

6. How good is the prediction?

Let's evaluate these predictions! This is our first test observation with the number of clicks equal to seven. And this is our corresponding prediction: we take the "y" from posterior_predictive, and slice it to get only the first column, corresponding to the first test observation. Let's pass it to plot_posterior. The posterior mean amounts to 9.7 and we are 94% sure its between 4.4 and 15. Not that bad, is it? This is the prediction for just a single test observation. It would be much more useful to estimate the model's error in general, based on many test examples. Let's see how to do this!

7. Test error distribution

We start be creating an empty list called errors to store the errors for subsequent test observations. Then, we iterate over the rows of ads_test using the iterrows method. Here, test_example is the row of the DataFrame, and the index is its index, from zero to four. For each row, we calculate the error as follows: we slice posterior_predictive just like before to take only the column corresponding to the index, and we subtract the true number of clicks from it. The first part is an array of 4000 predictive draws, and from each draw we subtract the true value to get the error distribution. Next, we append this error distribution to our list of errors. Now, errors is a list containing five arrays of 4000 values each. To get a general error distribution, we have to gather all these numbers into a single array. We can do so by passing errors to np-dot-array and reshaping it using the reshape method with -1 as the argument. This way, we get an array of shape 20000, which we can pass to pm-dot-plot_posterior.

8. Test error distribution

On average, our predictions are off by one point five clicks and we are more likely to predict too many than too few.

9. Let's make predictions!

Let's make predictions!