Engineering more features

1. Engineering features

Our linear model was OK, but it's limited. What if feature interaction causes price changes?

2. Feature interactions

Feature interaction occurs when effects of features are combined. For example, if we multiply the 14-day moving average and RSI together, the resulting correlation is stronger than the individual features, as seen on the far right column. This new feature correlates to the target with a value of 0-point-07, versus the individual values of -0-point-01 and 0-point-06.

3. One problem with linear models

One problem with linear models is we must explicitly add feature interactions. Other models we're about to use will be able to learn non-linear relationships between the features and targets. This includes random forests, gradient boosting, and neural networks.

4. Feature engineering

These more complex models allow us to engineer more features, which often improves predictive performance. We will use the volume data as one new feature. The volume is essentially uncorrelated to movements in price, but combinations of volume and other features may add some predictive power.

5. Volume

We're going to engineer some new features using the volume from our stocks. Volume is simply the number of shares traded in a given time period; in our case, this is shares traded per day. The plot shows the AMD price on top and the daily volume in the bar plot on the bottom.

6. Volume features

To engineer features from the volume data, we will calculate percent changes. We first get the percent change from one day to the next with the pct_change function from pandas, then use talib's SMA function to calculate the moving average of the percent volume changes.

7. Datetime feature engineering

Another opportunity we have for feature engineering is from the datetimes in our DataFrames. Almost all financial data will have a datetime associated with it, which allows for lots of feature engineering. Here, we are only going to use the day of the week. However, you could also add features for the month, the year, the quarter, or the number of days elapsed since some event -- for example, an earnings release. For hourly or more fine-grained data, you could add features for the hour, minute, second, and so on.

8. Extracting the day of week

To extract the day of week from our datetime, we can use the dot-dayofweek property. This yields a number, 0 to 6, for day of the week (0 being Monday). We can't put these numbers straight into our algorithms, though. Doing so would bias the classifiers to think that there is a linear relationship between the days of the week, which may not necessarily be there.

9. Dummies

Instead, we will use pandas' get_dummies function to dummy our day-of-week variables. This creates a new column for each day of the week, and assigns it a 1 or 0, depending on if it was that day of week. If it's a Monday, the dummy column 0 will be 1, and all other columns 0. This means we can infer if it's Monday from all the other columns except Monday. So we can drop the Monday column since it's not necessary. This is the drop_first argument in the get_dummies function. We also set the prefix argument to make the column names a bit more descriptive. When we print the new days_of_week DataFrame, weekday_1 on the left is Tuesday since we dropped Mondays, and weekday_4 on the right is Friday.

10. Weak correlations

Now that we have our days of week and volume features, we can look at how they are correlated to the target variable. The correlations are extremely weak as seen in the far right column of the plot; but remember, the idea here is these weakly-correlated features may interact with other features to create stronger predictions.

11. Engineer some features!

Now let's engineer some features!