Handling outliers
In this exercise, you'll handle outliers - data points that are so different from the rest of your data, that you treat them differently from other "normal-looking" data points. You'll use the output from the previous exercise (percent change over time) to detect the outliers. First you will write a function that replaces outlier data points with the median value from the entire time series.
Cet exercice fait partie du cours
Machine Learning for Time Series Data in Python
Instructions
- Define a function that takes an input series and does the following:
- Calculates the absolute value of each datapoint's distance from the series mean, then creates a boolean mask for datapoints that are three times the standard deviation from the mean.
- Use this boolean mask to replace the outliers with the median of the entire series.
- Apply this function to your data and visualize the results using the given code.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
def replace_outliers(series):
# Calculate the absolute difference of each timepoint from the series mean
absolute_differences_from_mean = np.abs(series - np.mean(series))
# Calculate a mask for the differences that are > 3 standard deviations from zero
this_mask = absolute_differences_from_mean > (np.____(series) * ____)
# Replace these values with the median accross the data
series[this_mask] = np.____(series)
return series
# Apply your preprocessing function to the timeseries and plot the results
prices_perc = prices_perc.____
prices_perc.loc["2014":"2015"].plot()
plt.show()