1. Baseline model
After going through the initial steps in the competition and feature engineering, it's time to train some machine learning models.
2. Modeling stage
Recall the modeling stage we've introduced in the previous chapter. We've already covered some data preprocessing techniques, like missing data imputation and categorical encoding, as well as creating new features. In this chapter, we will talk about model creation and some additional tricks to apply.
3. Modeling stage
To start this loop, we should establish the baseline model. It's usually a very simple model that allows us to check the whole pipeline we've written, review the local validation process, and generate the first submissions for the test data.
4. New York city taxi validation
Let’s again work with New York City Taxi competition data. We need to predict the fare amount for a taxi ride in New York City. The competition metric is root mean squared error.
For the sake of simplicity, we will use the 30% holdout sample as a local validation set. So, we create a simple holdout split using the train_test_split() function from scikit-learn.
5. Baseline model I
The simplest model is to assign the average fare value to all the test observations.
For this purpose, we take the mean of the 'fare_amount' column over the whole train set and just assign this number to all the observations in the test set.
Then, we select the id and fare_amount columns and write the predictions to the submission file.
Such an approach gives about 10 dollars RMSE in both Local Validation and Public Leaderboard. Also, it achieves the 1449th position on the Leaderboard out of 1500 participants.
6. Baseline model II
We could make the model a bit more complex by taking the mean grouped by the number of passengers. The idea is the same: assign the average value of fare amount to the whole group. Firstly, create a group object based on train data.
And then make predictions on the test set using pandas' map() method and projecting each passengers number to the corresponding average fare amount.
Then, again, write predictions to the file.
Such model achieves slightly better results with a 30 places improvement on the Public Leaderboard.
7. Baseline model III
Finally, we could apply an out-of-the-box sklearn Gradient Boosting model on all the numeric features available. We use these features only to discard preprocessing and simplify the baseline model.
Features include latitudes and longitudes together with the number of passengers.
We then fit the GradientBoostingRegressor on the train data
and make predictions on the test data.
8. Baseline model III
We write predictions to the file and submit to Kaggle.
And here are the results. Wow! It is a huge jump: we advanced 300 positions on the Public Leaderboard dropping the errors to about 5-6 dollars.
9. Intermediate results
Now we have three simple submissions with local and Public Leaderboard scores. Let's take a look at them.
One can easily see that local score correlates with the Public (the correlation is not perfect, but the better local score, the better it is on the Public Leaderboard). It is a good sign and we can proceed with our naive validation strategy.
10. Correlation with Public Leaderboard
Generally, the ideal situation is to observe such correlation between local validation and Public Leaderboard scores. The values should not be absolutely the same, but if the local score is improving, then we want to see improvements on the Leaderboard.
Let's compare the results of two different validation strategies. Results of the first validation strategy are presented in the table on the left. We see some improvements in the validation error with different models, but no improvements on the Public Leaderboard. It's a sign that something could be wrong with our models or validation scheme.
Now, look at another validation strategy on the right. With the improvement in the validation error, the Public Leaderboard error is also improving. So, this strategy is more reliable.
11. Let's practice!
All right, it's time to create a couple of baseline models of your own!