Clean Data

1. Clean Data

Welcome back. Let's now discuss a topic which affects most datasets, missing values.

2. Missing data

Since IoT Data is often gathered by small, battery-powered devices in remote locations, data collection may not be always stable. There could be interruptions in the network service, the battery could run out, or other factors out of our control could happen. This means that we might not have data for a few hours or days, or it could mean that some measurements are incomplete and we miss data in one or multiple columns. While working with data streams we have multiple opportunities to deal with missing or corrupted data. We can either take care of erroneous values right in the stream, which is often required if we apply an algorithm in real-time and can save disk space, or store the measurements anyway, and clean the data as part of the analysis.

3. Dealing with missing data

Depending on the amount of missing data, and in which column it appears, we have different methods at our disposal. If only a few observations are missing, we can fill them with the series mean or median value. Pandas also offers several alternatives like forward-fill or backward-fill, which all have different benefits and drawbacks. We can also drop observations with missing values, but consider this carefully, since it'll reduce the amount of data available for future analysis and Machine learning models. Ultimately, if there are too many key values missing, we might even need to decide to stop analyzing the dataset since handling the missing data would have removed most of the insights.

4. Detecting missing values

df.info() can help us identify missing values. The dataset has 12 entries. Precipitation has 12 non-null values, so it seems to not be missing any data. However, temperature and humidity only have 8 non-null entries, so it looks like they both have 4 missing values.

5. Drop missing values

This is confirmed by printing the first 5 rows to screen. We can drop missing values in pandas by using df.dropna(). By default, this will remove every row with a NaN value inside. As we can see, the 2nd and 4th rows have been removed. This reduces the amount of data we have for the precipitation column.

6. Fill missing values

We may not always be allowed to drop observations from a dataset. A better alternative method is to fill the missing values. By filling the data with something that makes sense in the context of the series, we don't lose any information about the column without missing observations. If our method to fill NaN values is forward-fill, non-null values will be forward-propagated, so values from row 1 are copied to row 2. We can achieve this by using the method fillna(), pass in the method equals ffill. Backward-fill works the reverse way by taking the next row within the series instead of the previous row to fill missing values.

7. Interrupted Measurement

Another common task is to detect if the data collection was interrupted. While .info() shows us that we have measurements missing in one column, it's not telling us that we miss an hour or day from all columns. We can verify that we have no missing values by using the .isna() method, which returns True if the content is NaN. By taking the sum of this, we get the count of missing values, since true is treated as 1. Our dataset, which has an interval of 10 minutes seems to have no missing values. Let's now resample the data with the expected interval of 10 minutes. The first few rows did not change, however, we now have 34 intervals without any data between the first and the last date in the DataFrame, so there was at least one period without any observations.

8. Interrupted Measurement

We can visualize the missing time period with matplotlib. There are no values between the 8th and the 9th October, which confirms our finding of having an interval without successful data collection.

9. Let's practice!

And now, let's practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.