Imputing time-series data

1. Imputing time-series data

In this lesson, you'll learn to impute missing values in time-series data. Dealing with time-series data requires a different approach compared to regular methods.

2. Airquality Dataset

We'll use the 'air quality' dataset which is a time-series dataset that measures airquality over time. Before we jump into the imputation techniques, let's explore the dataset with 'airquality.head()'. This dataset has four columns.

3. Airquality Dataset

We can find the number of missing values in a column using 'airquality.isnull().sum()'. Likewise, checking for the missingness percentage, you'll find that Ozone has a whopping 24% of the values missing! We'll now explore all the possible techniques which can impute these missing values as accurately as possible.

4. The .fillna() method

We can use the '.fillna()' method to impute missing values in a time-series DataFrame. 'fillna()' has two strategies 'ffill' or 'pad' and 'bfill' or 'backfill' which can be selected using the argument 'method' Let's understand them in detail.

5. Ffill method

When we set the 'method' to "ffill" it replaces all 'NaN's with the last observed value. It can be declared as 'airquality.fillna()' of 'method="ffill"'. Lets understand this more clearly!

6. Ffill method

Checking for 'NaN's from 30th to 40th row in 'Ozone' column of the 'airquality' dataset shows that there are 6 consecutive missing values from 31st to 36th row . Once we apply 'fillna()', with the 'method' set to 'ffill', you observe that all the subsequent 'NaN's are filled with the last observed value 37. Likewise for the 'NaN' after the value 29.

7. Bfill method

In contrast to the 'ffill' strategy, 'bfill' replaces the 'NaN's with the next observed value that comes after the 'NaN'. It must be noted that 'backfill' is the same as 'bfill'.

8. Bfill method

Imputing the missing values in 'airquality' with the 'bfill' method shows that the 'NaN's are filled with 29 which is the next observed value after the 'NaN's. While the method 'fillna()' can also be used for imputing non time-series data,

9. The .interpolate() method

the 'interpolate()' method is highly appropriate for imputing time-series data. This method has more complex strategies which can draw patterns from non-missing values to predict missing ones. In this lesson, we are going to explore the 'linear', 'quadratic' and the 'nearest' strategies.

10. Linear interpolation

The 'linear' method imputes the missing values by extrapolating a straight line between the last observed value and the next one after the 'NaN'. Let's understand this clearly using the 'airquality' DataFrame.

11. Linear interpolation

When imputing missing values of the 'Ozone' column using the linear method, you'll observe that the missing values between 37.0 and 29.0 slowly increment equidistantly by 1.1 or 1.2 units starting from 37 till 29. Similarly, the missing value between 29 and 71 increments by 21 as the value that is equidistant from both surrounding values is 50.

12. Quadratic interpolation

With the 'quadratic' method, the values are imputed quadratically.

13. Quadratic interpolation

In the 'airquality' dataset, imputing with 'quadratic' interpolation, takes a parabolic trajectory in the negative direction and shoots back to a positive value. However, it must be observed that such values are highly unlikely!

14. Nearest value imputation

The nearest value imputation on the other hand, is a combination of 'ffill' and 'bfill' where the missing value is imputed with the nearest observable value.

15. Nearest value imputation

You can observe in the 'airquality' DataFrame that the first 3 consecutive missing values between the 30th and 40th row are filled with 37 while the next 3 consecutive missing values are filled with the nearest value 29.

16. Let's practice!

It's now time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.