Prepare and visualize incremental data

1. Prepare and visualize incremental data

Welcome back.

2. Data preparation

In this lesson, we will prepare and analyze the data gathered in the previous lesson. We will pivot and downsample the data to a longer timeframe, calculate the difference between two consecutive observations, and calculate the percent-change over time.

3. Data structure

If we have a look at the data, we can see that we have 3 columns: timestamp, device, and values. This data-format is difficult to work with, as it does not allow for easy comparison between devices. We should have one column per device, each containing data for its respective device.

4. Pivot table

We can accomplish this by using pd.pivot_table(), which converts rows into columns and is similar to pivot tables in spreadsheets. We use "device" as columns argument so each unique value in the device column will become its own column in the new DataFrame. We specify "timestamp" as index, which will be the index of the new DataFrame. We also specify "value" as value argument, which will represent the values of each cell.

5. Apply pivot table

We will now apply pivot_table() to the data to have each device in its own column. Notice that we generated a lot of NAN values. This happens because the events of both devices are separated by sub-seconds.

6. Resample

To fix this mismatch, we will downsample the DataFrame to 1 minute bins. Resampling will put the data into bins of the given interval and apply an aggregation function to the data. We use this to remove noise in data, and to reduce the amount of data. Resampling only works with timeseries DataFrames, which have the index of type DatetimeIndex. We need to specify an aggregation function like min(), sum(), max() or mean(). In our case, we'll use max() because we're working with incremental data, which means that the latest value in each bin will also be the highest. This produces many empty rows because resample creates exactly one row per minute. We have multiple observations per minute, however we also have some longer periods of missing data. Therefore we append a call to dropna(), so rows without a valid entry are dropped.

7. Visualize data

Now the data is ready to be analyzed further, so let's plot the data. Notice that the plot is not very informative, showing two straight lines. Both devices have different scaling and increment almost linearly. This is because the counter gives a running total. What we really care about is how much energy was consumed in a certain period of time.

8. pd.diff()

We can accomplish this by taking the difference between 2 consecutive measurements. This can be done by using data.diff(). We specify a number of periods the data should be shifted before subtracting. We'll stick to 1 here, as we are interested in the difference between 2 consecutive observations.

9. Data analysis - difference

Looking at the plot now, we see a fairly noisy plot with a few huge outliers. These are not real consumption spikes, but points where the data collection was interrupted for a few minutes. In this case, using a bigger resample interval (longer than the interruption) removes the outliers and results in a much smoother plot.

10. Change percentage

Plotting the difference shows us the total consumption between two intervals. It's non-obvious if an absolute drop of 10 watt-hours between 2 intervals means a 1% or 20% drop in consumption. By applying df.pct_change() to the differential DataFrame, we can calculate the change-percent between consecutive rows. This plot allows us to compare the devices.

11. Let's Practice

Now, lets practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.