Get startedGet started for free

Prepare your first submission

1. Prepare your first submission

We already know how to explore the train data, and how to determine what our submission file should look like. Let’s learn how to prepare our first Kaggle submission.

2. What is submission

Recall the graph of the competition process. In this lesson, we will talk about the green block. It consists of building the model and preparing the submission file. Submission is usually a .csv file that contains our test predictions and is submitted to the Kaggle platform. Kaggle internally measures the quality of the predictions and shows the results on the Leaderboard. We will talk about the Leaderboard in the next lesson.

3. New York city taxi fare prediction

We will again work with the taxi fare prediction problem. Recall that the train data contains target variable 'fare_amount' and some other features like pickup and dropoff positions, together with pickup datetime and the number of passengers. Under features, we mean all the variables that are used to predict a target variable.

4. Problem type

Before creating any Machine Learning model, we should determine the problem type we're addressing: whether it's classification, regression or some other problem. For this purpose, let's plot a distribution of the 'fare_amount' column on a histogram using pandas hist() method. It's clear from the image that 'fare_amount' is a continuous variable. That's why we're dealing with a regression problem.

5. Build a model

What is the simplest method that comes to mind when we hear about the regression problem? Of course, the linear regression. Let's take a couple of features available in the train set and build a simple linear regression model using scikit-learn, initially creating a LinearRegression object and then fitting it on the train data. Note that to select multiple columns in pandas DataFrame we need to use double brackets.

6. Predict on test set

After we've trained a model, the next step is to make predictions on the test set. Select the set of columns the model has been trained on. Then take the fitted LinearRegression object, built in the previous slide, and predict the fare amount. The predictions results will be stored in a new 'fare_amount' column.

7. Prepare submission

Having made test predictions, we prepare a submission file. Usually, Kaggle submission files are in a .csv format with 2 columns: ID and target variable predicted. As we already know, the format of the output file is specified in the sample submission. Let's look at the one from the taxi fare competition. It consists of columns 'key' and 'fare_amount'. So, select these columns from the test DataFrame and save them into the .csv file. To write DataFrame to the .csv file, use pandas to_csv() method. The resulting file is ready to be submitted on Kaggle! In the next lesson, we'll learn more about the submission upload to Kaggle.

8. Let's practice!

But first, let's practice! You've seen how to develop a simple model and prepare the results for the submission. Now, it's your turn to prepare a submission file for the Demand Forecasting Challenge.