Feature engineering from volume
We're going to use non-linear models to make more accurate predictions. With linear models, features must be linearly correlated to the target. Other machine learning models can combine features in non-linear ways. For example, what if the price goes up when the moving average of price is going up, and the moving average of volume is going down? The only way to capture those interactions is to either multiply the features, or to use a machine learning algorithm that can handle non-linearity (e.g. random forests).
To incorporate more information that may interact with other features, we can add in weakly-correlated features. First we will add volume data, which we have in the lng_df
as the Adj_Volume
column.
Before you begin, remember that for TA-Lib functions (such as SMA()
), you need to provide Numpy arrays, not pandas objects. You can use the .values
attribute of a pandas Series or DataFrame to return it as a Numpy array.
This exercise is part of the course
Machine Learning for Finance in Python
Exercise instructions
- Create a 1-day percent change in volume (use
pct_change()
from pandas), and assign it to theAdj_Volume_1d_change
column inlng_df
. - Create a 5-day moving average of the 1-day percent change in Volume, and assign it to the
Adj_Volume_1d_change_SMA
column inlng_df
. - Plot histograms of these two new features we created using the
new_features
list.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create 2 new volume features, 1-day % change and 5-day SMA of the % change
new_features = ['Adj_Volume_1d_change', 'Adj_Volume_1d_change_SMA']
feature_names.extend(new_features)
lng_df[____] = lng_df['Adj_Volume'].____
lng_df[____] = talib.SMA(____[____].____,
timeperiod=____)
# Plot histogram of volume % change data
lng_df[____].plot(kind='hist', sharex=False, bins=50)
plt.show()