Create train and test features
Before we fit our linear model, we want to add a constant to our features, so we have an intercept for our linear model.
We also want to create train and test features. This is so we can fit our model to the train dataset, and evaluate performance on the test dataset. We always want to check performance on data the model has not seen to make sure we're not overfitting, which is memorizing patterns in the training data too exactly.
With a time series like this, we typically want to use the oldest data as our training set, and the newest data as our test set. This is so we can evaluate the performance of the model on the most recent data, which will more realistically simulate predictions on data we haven't seen yet.
This exercise is part of the course
Machine Learning for Finance in Python
Exercise instructions
- Import the
statsmodels.api
library with the aliassm
. - Add a constant to the
features
variable using statsmodels'.add_constant()
function. - Set
train_size
as 85% of the total number of datapoints (number of rows) using the.shape[0]
property offeatures
ortargets
. - Break up
linear_features
andtargets
into train and test sets usingtrain_size
and Python indexing (e.g.[start:stop]
).
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the statsmodels.api library with the alias sm
___
# Add a constant to the features
linear_features = sm.____(features)
# Create a size for the training set that is 85% of the total number of samples
train_size = int(0.85 * ____)
train_features = linear_features[:train_size]
train_targets = targets[____]
test_features = linear_features[train_size:]
test_targets = targets[train_size:]
print(linear_features.shape, train_features.shape, test_features.shape)