Get startedGet started for free

Linear modeling

1. Linear modeling with financial data

Now that we have features and targets, we can fit our first machine learning model -- a linear model.

2. Make train and test sets

For machine learning, we usually split our data into train and test sets. With time series data, we want to break up our train and test into continuous chunks. The training data should be the earliest data, and the test data should be the latest data. We fit our model to the training data and test on the newest data to understand how our algorithm will perform on new, unseen data. We can't use sklearn's train_test_split because it randomly shuffles the train and test data.

3. Make train and test sets

For linear models, we need to add a constant to our features, which adds a column of ones for a y-intercept term. statsmodels has add_constant() for this. Then we split the data into train and test sets. First, we get the index we'll split at by using the train set fraction and the number of rows in our data. We get the number of rows from the dot-shape property and convert this to an integer. Finally, we split features and targets into train and test sets using Python's indexing. Remember Python indexing goes [start:stop:step]. Here, we start from the beginning and go to train_size for the training dataset, then go from train_size to the end of the data for the test set.

4. Linear modeling

Now that we have our train and test sets, we can fit a linear model. We first create the model with the OLS() function from statsmodels, giving it our train_targets and train_features. Then we use the fit method which returns an object with the results of the fit.

5. Linear modeling

Printing out the summary of the fit results will yield a lot of information.

6. Linear modeling

We see the R-squared value in the upper right and many other metrics. We can compare this value with R-squared from other models we try. An R-squared of 1 means a perfect fit, and the lower the R-squared value, the worse our fit. We also see the coefficients for each feature under the coef column near the bottom left. These are the amount the target changes for a unit change in the feature. A positive value means the target is increasing as that feature increases, and vice versa for a negative coefficient.

7. p-values

Linear models are one of the simplest machine learning models and are easy to interpret. For example, we can use the p-values, or the P > t column to understand which variables are meaningfully correlated to the target. We get these values with results-dot-pvalues. p-values are the result of a t-test on the coefficients. This is a statistical test checking if the coefficients are significantly different from 0. The p-value is the % chance that the coefficient is actually 0. Typically we say a p-value of less than 0-point-05 means our coefficient is significantly different from 0, and in this case, it looks like a few of our coefficients are significant.

8. Plotting the results

Lastly, we'll plot our results. A quick way to see how our model is doing is to plot the predictions versus the actual values. Perfect predictions yield a straight line, which is what happens when we plot the true targets versus themselves.

9. Plotting the results

However, linear models are pretty weak, so we will probably get more of a point cloud like this. When we start using more advanced models like gradient boosting and neural networks, our predictions should improve.

10. Time to fit a linear model!

Now let's create our first machine learning model with our data!