Get startedGet started for free

Understand the problem

1. Understand the problem

In the previous chapter, we got acquainted with what Machine Learning competition actually looks like, and had an overview of the general competition process. Now it's time to start solving the problems!

2. Solution workflow

Before proceeding, let's take a look at the broad scheme that we'll be using throughout the subsequent chapters. Let's call it a 'solution workflow'. Typically it consists of four major stages. First, we start by understanding the problem and the competition metric.

3. Solution workflow

Then we need to make some EDA (exploratory data analysis) in order to see and understand the data we're working with.

4. Solution workflow

The next very important step is to establish the local validation strategy. We already know that its goal is to prevent overfitting.

5. Solution workflow

Finally, the longest part of the competition is Modeling, which includes continuous improvements of the solution. In this chapter, we will talk about the first three blocks. The third and fourth chapters are entirely devoted to Modeling.

6. Understand the problem

To understand the problem we need to perform the following steps. Determine the data type we will be dealing with. Is it the usual tabular data?

7. Understand the problem

Or maybe we're given time series data.

8. Understand the problem

Or it's unstructured data like images.

9. Understand the problem

Or text, and so on. It could be even a mix of multiple data types. In this course, we mostly concentrate on the tabular data and time series. No worries, the general solution workflow is the same for any data type. The next step is to determine the problem type. We've talked about it a little in the previous chapter. Here we should select between classification, regression, ranking and so on. Lastly, we should get familiar with the metric being optimized. As we already know, every competition has a single metric. It is used by Kaggle to evaluate the submissions and to determine the best performing solution.

10. Metric definition

Generally, the majority of the metrics can be found in the sklearn.metrics library. However, there are some special competition metrics that are not available in scikit-learn. In such cases, we have to create metrics manually. Suppose we're solving the competition problem with Root Mean Squared Logarithmic Error as an evaluation metric. This metric is not implemented in scikit-learn. Its formula is presented on the slide. N is the number of observations in the test set, y is the actual value, y hat is the predicted value. So, it is a usual Root Mean Squared Error in a logarithmic scale. In this situation, we have to define a custom function that takes as input the true and predicted values, and outputs the metric value. Firstly, we compute squares under the sum using numpy log and power methods. Finally, we get the square root of the mean over all the observations, and return the result.

11. Let's practice!

The main takeaway from this lesson is that before building any models, we should perform some preliminary steps to understand the data and the problem we're facing. So, let's practice with other problem types and metrics!