1. Predicting parallel slopes
Predicting responses is perhaps the most useful feature of regression models. With two explanatory variables, the code for prediction has one subtle difference from the case with a single explanatory variable.
2. The prediction workflow
The prediction workflow starts with choosing values for explanatory variables. You pick any values you want, and store them in a pandas DataFrame.
For a single explanatory variable, the DataFrame has one column. Here, it's a range of lengths from 5cm to 60cm, in steps of 5cm. You can reuse the np dot arange function we saw in the previous course to specify this. Notice that the end argument of the interval is not inclusive.
3. The prediction workflow
For multiple explanatory variables, you need to define multiple columns in your explanatory DataFrame.
Say for example, that you would like to create a DataFrame that holds all combinations of A, B, and C, and the numbers 1 and 2. You could manually create such a DataFrame, but this would be cumbersome and not scalable for more than two explanatory variables.
A useful trick to create such a DataFrame is to use the product function from the itertools module. The product function returns a Cartesian product of your input variables. In other words, it outputs all combinations of its inputs. Let's apply this to our fish dataset.
You first create your explanatory variable lists.
For a categorical variable, we use pandas' unique method. This method extracts the unique values of your categorical variable.
The product function then creates a combination of all of the elements of these input lists.
Lastly, you transform the output of the product function into a pandas DataFrame, and name the columns.
Here, you have 5cm and each fish species, 10cm and each fish species, all the way to 60cm and each fish species.
4. The prediction workflow
Next you add a column of predictions to the DataFrame. To calculate the predictions, start with the explanatory DataFrame, call assign, name the response variable and use the predict method on the model, passing explanatory data as the argument. Here's the code for one explanatory variable.
With two or more explanatory variables, other than the variable naming, the code is exactly the same! Notice the different number of rows between the two outputs.
5. Visualizing the predictions
We can visualize the predictions from the model by adding another
scatter plot and setting the data argument to prediction_data. I also set the color argument
to black to distinguish between predictions and actual data points.
Notice how the black prediction points lie on the trend lines.
6. Manually calculating predictions for linear regression
In the previous course, you saw how to manually calculate the predictions for linear regression. The params attribute contains the coefficients from the model.
The intercept is the first coefficient, and the slope is the second coefficient.
Then the response value is the intercept plus the slope times the explanatory variable.
7. Manually calculating predictions for multiple regression
For the parallel slopes model, we already saw that each category of the categorical variable has a different intercept.
This means that, to calculate predictions, you would have to choose the intercept using if-else statements. This becomes clunky when you have lots of categories.
8. np.select()
NumPy has a function called select that simplifies the process of getting values based on conditions. np dot select takes two arguments: a list of conditions, and a list of choices. Both lists have to be of the same length. You can read it as: 'If condition 1 is met, take the first element in choices, if condition 2 is met, take the second element in choices', and so on.
The output is an array drawn from the elements in choices, depending on conditions.
This is very abstract, so let's look at how we use it for predictions.
9. Choosing an intercept with np.select()
The conditions list contains a condition statement for each species. It returns a True of False whether the species is Bream, Perch, Pike, or Roach.
The choices list is the collection of intercepts that were extracted from the model coefficients. Recall that both lists have to contain the same number of elements.
np dot select will then retrieve the corresponding intercept for each of the fish species. Since our explanatory dataset contained 48 rows of data (12 for each fish species), the output will contain 48 intercepts as well. Notice the recurring pattern in the intercepts, corresponding to the repeating fish species.
10. The final prediction step
The final step is to calculate the response. As before, the response is the intercept plus the slope times the numeric explanatory variable. This time, the intercept is different for different rows.
The model predicts some negative masses, which isn't a good sign. Let's check that we got the right answer by calling predict.
11. Compare to .predict()
You can see that the predictions are the same numbers as the mass column that we calculated, so our calculations are correct. It's just that this model performs poorly for small fish lengths.
12. Let's practice!
Time for you to make predictions.