Outliers
1. Outliers
Let's now see how to detect outliers and trends.2. Outliers
Many datasets suffer from outliers, which are data points which are far from the expected range. Reasons for outliers in IoT data include measurement errors, like data from a cheap sensor which is generating wrong measurements from time to time. Manipulation, if the sensor is publicly accessible. Or extreme events, which represent valid measurements, which however are out of the expected range, for example, a heavy storm.3. Outliers
A common method to detect outliers is to use three times the standard deviation. By definition, every data point outside of the triple standard-deviation is considered as an outlier. First, we calculate the mean and the standard deviation. We then add and subtract three times the standard deviation from the mean to get the upper and lower limits, and assign these values to separate columns in the DataFrame to facilitate simpler plotting.4. Outlier plot
If we plot this, we can see the time series in blue, the mean in orange, and the upper and lower limits in green and red respectively. We can also see multiple data points breaking out of the defined channel.5. Autocorrelation
Autocorrelation refers to the correlation of a time series with a delayed version of itself. For example, an autocorrelation of lag 3 returns the correlation between a time series and its own values delayed by three time points. We can plot the autocorrelation with the statsmodels package. First, we import tsaplots from statsmodels.graphics. We then pass our series into plot_acf() and specify the number of lags we'd like to see. We use 50 lags, for hourly data, this will give us 50 hours, or 2 full days and 2 hours. For each hour between one and fifty, we will get the autocorrelation with itself. The y axis ranges from 1 to -1 and specifies the correlation. The x axis shows the correlation between an observation and the observation x points prior to that. The shaded area defines the confidence interval, so all points outside of the blue area can be considered statistically significant. In the plot, we see that for hourly temperature data,6. Autocorrelation
we have a high autocorrelation of 0.8 at lag 24, or after 24 hours, and a negative autocorrelation of -0.3 at lag 12, or after 12 hours. The same repeats itself after 48 and 36 lags respectively. This is supported by the graphs we've seen in previous lessons, with the temperature being lower every night, and rising throughout the day.7. Let's practice!
And now, it's your turn to detect some outliers.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.