Prepare data for machine learning

1. Prepare data for machine learning

Now that you have accessed, processed and analyzed your data, its ready for machine learning! In this chapter, we will first split our data into train and test datasets, before we develop a simple machine learning pipeline, train it on the data we've seen throughout this course, and apply the model to a data stream for real-time predictions.

2. Machine Learning Refresher

Machine learning is a very broad field, which could easily fill multiple courses on its own. It includes: Supervised learning, which itself is separated into Classification and Regression algorithms, Unsupervised learning, or cluster analysis, and Deep learning, or neural networks.

3. Machine Learning Refresher

In this course, we will only cover classification, a supervised learning technique.

4. Labels

The environmental data we've been working with so far has been labeled for us. The data has been gathered via an application, where users could label how the weather felt - either Good weather, represented as 1 or bad-weather, represented as 0.

5. Train / Test split

In machine learning, it's common to split data into multiple subsets, normally a train and a test set. We do this to ensure that the model can predict new, unseen data, and therefore ensure it does not overfit. The model should not see the test data during training, and the test-data is only used to validate the model. Common split-ratios are 80:20, having 80% of the data available for training, and 20% available for testing. For time series data, we cannot randomly split the data into train and test sets. We should prevent the model from looking into the future, as future data will not be available when applying the model to real data. For this reason, we split the data at a cut-off point, at which we separate the data into train and test data.

6. Train / test split

To manually define the split-date, we first assign this date to the variable split_day. We choose a date at around 80% of the total data, so the split is roughly 80:20. We then use time-series slicing to get all data before the cut-date into the train set, and all data after the cut-date into the test set. We can inspect the result by using iloc[] to get the first and last element and plot the name attribute - which is the index of the row. We can see that the first 2 lines range from October first to the last entry for October 13th, and the test-data from October 14th until October 15th, as expected from the cutoff we chose.

7. Features and Labels

To apply the model, we also need to separate the labels from the training features. We do this by dropping the target-variable from the combined DataFrame and assign this to X - our features, and by selecting the target-column and assign that to y - our labels. We need to implement this step for the train and test datasets. If we look at the shape of the new DataFrames, we see that X_train has 3 columns, or features, and 1248 observations, while y_train has the same amount of observations, but only one column.

8. Logistic Regression

Let's now build our first machine learning classification model. We start by importing LogisticRegression from sklearn.linear_model. Then LogisticRegression() is initialized and stored as logreg. We then fit the model to the Data, where X_train is our training Dataset, and y_train contains the labels for the training dataset. We use .predict() to classify the test data, and print the resulting classes to screen. The predicted classes are two consequential bad-weather classifications, followed by five good prediction, and again 2 bad weather classifications.

9. Let's practice!

And now it's your turn.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.