Get startedGet started for free

Combining datasources for further analysis

1. Combining datasources for further analysis

Welcome to this chapter. Now that you know how to access, load, explore and prepare IoT data, you are ready to analyze it!

2. Combining data sources

While working with IoT Data, we will often want to combine multiple datasets. Here, we have the temperature in degree Celsius and sunshine duration in seconds for the same time-period in separate DataFrames. Each dataset is loaded into a different variable and consists of observations with a timestamp as index and a measurement. Notice that both datasets have a different frequency. The temperature data is hourly, while the sunshine data has two data points each hour. To analyze this data, we should combine it into one DataFrame with multiple columns.

3. Naming columns

Columns in both DataFrames were called "values", so let's start by correctly naming the column for each DataFrame. We do this by assigning the name to the DataFrame.columns property. This needs to be a list, and the length of the list needs to match the amount of columns in the DataFrame. This is an important step since it makes sure that we can easily identify our measurements later.

4. Concat

After having the columns in our datasets renamed, we can combine the datasets by using pandas .concat() function. .concat() takes a list of DataFrames, and returns a new, combined DataFrame. By default, .concat() appends DataFrames below each other, so we need to specify axis=1 to make sure to combine the data along the column axis. In the output, we can see the combined DataFrames, with both columns combined into one DataFrame. Every 2nd entry in the temperature column is NaN, since temperature did not have any values at the half-hour marks. While we could use dropna() or fillna(), we'll use resample() to get rid of the NaN values.

5. Resample

We'll resample the DataFrame to 1h bins, and use .agg() as aggregation method. agg() takes a dictionary with column-names as keys, and the function to execute as values. We use max() for the temperature, and sum() for the sunshine column, since sunshine is a counter variable, containing the number of seconds. While this reduces the total amount of data points, we don't lose much information, since the sunshine per hour remains the same before and after resampling. We'll then execute the resample method to 1 hour and apply the method .agg() with the agg_dict we just defined. The output's temperature values remain the same, while sunshine was aggregated to the hourly interval.

6. Fillna

An alternative method is to use fillna(), with the method forward-fill. This method allows us to keep the higher frequency of the data, half hour intervals in this case, while also eliminating the NaN values. For each missing value, the measurement of the previous row is copied to the row with missing values. The sunshine data is kept untouched, since all values are available.

7. Let's practice!

And now, it's your turn to put this into practice!