1. Imputing missing values
Welcome back!
2. Regular and irregular time series
As a reminder, a 'regular' time series is one without missing values or unevenly-spaced intervals.
Many real-world datasets do not conform to this standard; perhaps a sensor was out of order on a certain day, or measurements could only be taken on clear, sunny days. Whatever the case, we should also be prepared to work with 'irregular' time series.
To keep our data tidy and as regular as possible, there are two key approaches; we've already covered one. There's aggregation, which resamples data to a lower resolution by summarizing the values within each unit, like a monthly total of daily observations.
The other, what this lesson focuses on, is 'imputation', which works by filling in missing values based on different methods. Let's dive in!
3. Imputation
Imputation refers to the process of replacing missing or erroneous data with substituted values, often by taking the average of neighboring values.
Let's look at an example.
Visually, the line appears broken and discontinuous; let's zoom in a bit further.
In this case, the best option would be to impute our data: we can try to fill in the missing values based on different criteria.
4. Imputing values with zoo
The zoo package has a great range of helper functions to facilitate the imputation process — these are the na-dot functions, which we'll look at in this lesson!
The most commonly-used functions are: na-dot-fill, na-dot-locf, and na-dot-approx. Each function is designed for a different purpose; it's crucial to understand which imputation method to use to suit our data.
5. Determining missing values
To find the count of missing or NA values, we can use the sum-is-na syntax, like so. is-na returns a TRUE for every NA value in the time series, and taking the sum of these values gives us the total number of NAs.
Here, our dataset 'observations' has 23 NA values.
6. na.fill
For data where the missing observations are assumed to be some default value, the best function to use is na-dot-fill. Based on the distribution of the values, let's assume those NAs are supposed to be zero.
Let's look at a plot of observations to see if that would make sense.
It appears that the zero values are missing! Let's fill them in with na-dot-fill.
7. na.fill
na-dot-fill requires two arguments; the time series in "object", and the value to fill by in "fill".
The result is another time series, where all of the NA values have been replaced by the value in fill.
8. na.locf
Let's look at the next na-dot function, na-dot-locf. LOCF stands for "Last Observation Carried Forward"; the function works by finding the preceding non-NA value at each NA value.
LOCF is often used in surveys and studies, where a participant drops out at a certain point; the most recent non-NA value is used to fill in the missing observations.
9. Linear interpolation
The last function we'll look at is na-dot-approx, which works by using 'linear interpolation' to fill in missing values.
10. Linear interpolation
By connecting the non-NA values on either side of the missing observation, linear interpolation can approximate the values at the missing points.
11. Linear interpolation
The result is a continuous change between non-NA observations!
12. na.approx
For time series that track a continuous variable, like temperature, precipitation, price, etc, na-dot-approx is often the best choice; when there are a small number of missing values, linear interpolation can accurately impute the absent observations.
13. Let's practice!
Head to the exercises and practice imputing missing values!