Explore the data with some EDA
First, let's explore the data. Any time we begin a machine learning (ML) project, we need to first do some exploratory data analysis (EDA) to familiarize ourselves with the data. This includes things like:
- raw data plots
- histograms
- and more…
I typically begin with raw data plots and histograms. This allows us to understand our data's distributions. If it's a normal distribution, we can use things like parametric statistics.
There are two stocks loaded for you into pandas DataFrames: lng_df
and spy_df
(LNG and SPY). Take a look at them with .head()
. We'll use the closing prices and eventually volume as inputs to ML algorithms.
Note: We'll call plt.clf()
each time we want to make a new plot, or f = plt.figure()
.
This exercise is part of the course
Machine Learning for Finance in Python
Exercise instructions
- Print out the first 5 lines of the two DataFrame (
lng_df
andspy_df
) and examine their contents. - Use the pandas library to plot raw time series data for
'SPY'
and'LNG'
with the adjusted close price ('Adj_Close'
) -- setlegend=True
in.plot()
. - Use
plt.show()
to show the raw time series plot (matplotlib.pyplot
has been imported asplt
). - Use pandas and matplotlib to make a histogram of the adjusted close 1-day percent difference (use
.pct_change()
) for SPY and LNG.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
print(lng_df.head()) # examine the DataFrames
print(____) # examine the SPY DataFrame
# Plot the Adj_Close columns for SPY and LNG
spy_df[____].plot(label='SPY', legend=True)
lng_df[____].plot(label=____, ____, secondary_y=True)
____ # show the plot
plt.clf() # clear the plot space
# Histogram of the daily price change percent of Adj_Close for LNG
lng_df['Adj_Close'].____.plot.hist(bins=50)
plt.xlabel('adjusted close 1-day percent change')
plt.show()