1. Introduction to time series
In this video, we'll be focusing on a common use case for anomaly detection: identifying anomalies in time series data.
2. Google stocks dataset
Time series is a common type of dataset, defined as a series of datapoints with a temporal component (date or time). Identifying anomalies in time series is a key skill for working with time series data. Let's have a look at how this type of datasets work. Here's an example from Google stocks.
Every time series has a temporal component and corresponding values. In this case, the temporal component is called "Date" ranging from 2006 to 2018. The values of the time series are the open/close, low/high and volume attributes of Google stocks for each day.
3. DateTime datatype
The date column isn't much use if it is represented as a string or object datatype, which is the default behavior from the read_csv function.
For time series, pandas has a more flexible DateTime data type, to which we can convert the Date column using the to_datetime function.
As you can see, the new datatype is represented as datetime64.
4. Extracting features
Now, using this column, we can extract useful information from the time series like day of the week, day of the month, month number, etc.
These attributes are accessible via the dot-dt accessor.
We create three columns to store the three new features.
5. DatetimeIndex
We can change the index to become a DatetimeIndex and update the DataFrame directly using the syntax shown.
6. Choosing periods
The DatetimeIndex enables us to quickly filter the data between two years,
7. Choosing periods
or from the beginning of March 2012 to the 4th of October of 2015.
8. Loading datasets with a DatetimeIndex
Let's see how to load a dataset with a DatetimeIndex instead of setting an index later. This involves using two new parameters: parse_dates and index_col of read_csv.
9. Plotting time series
With a DatatimeIndex, we can call the dot-plot method on any column to get a lineplot.
10. Plotting time series
To get a more granular look, we can first zoom into an interval using DateTime indexing and then plot.
This produces a line plot of the closing price of Google stocks from January to July 2010.
11. Plotting time series
To start outlier detection on this dataset, you can use both univariate and multivariate approaches. For example, let's use Median Absolute Deviation on the number of stocks traded. First, let's visualize it. We can already see a lot of spikes in the plot which hints at the presence of outliers.
12. MAD on time series
Now, let's fit MAD to the Volume column.
Using the labels attribute and pandas boolean indexing, we find 236 outliers. Now, let's try the multivariate approach.
13. IForest on time series
First, we add three new features to the Google dataset as we did before. The only difference is that we use the dot-index accessor instead of dot-dt on a column name because the date component is in the index now.
Adding the new features allows the outlier classifier to see how the prices of stocks and their traded amount varies day-to-day, week-to-week and month-to-month.
14. IForest on time series
Now, let's use IForest.
Since the dataset is relatively small, we only need 100 iTrees, which is the default in IForest. Next, we find the number of outliers using a probability threshold of 75%.
This time, we find almost one fourth of the outliers in MAD. It is likely that using the rest of the columns and the new features helped IForest to identify new patterns about the nature of inliers and outliers in the data.
15. Let's practice!
Now, let's practice!