Merging and aggregating data

1. Merging and aggregating data

Hello. You've cleaned the data and it's consistent and useful. You now need to combine and summarize it to get the answers you need.

2. Back to the plan

Now for the next process steps. First, combine the cleaned data into a single table. Next is aggregation - summarizing the data in different ways to be able to give the results the business case requires.

3. All in one

The nodes you'll use most often for combining data tables are the Value Lookup, Concatenate, and Joiner nodes. The Concatenate node is used where the tables are very similar, with the same columns available. The Joiner node is used when you're matching with a key between the two tables, resulting in additional columns. It's always good practice to add a deduplicating node at the end of the merging process. As ever - don't forget to annotate.

4. Summarizing it all

For aggregating data, the nodes you'll use most often are the Row Aggregator, GroupBy, and Pivot nodes in order of complexity. It's worth using the simplest node that will achieve your desired result to avoid potential errors. Make sure that you check the output carefully to confirm that it is the output that you are expecting.

5. Using the right aggregations

How do you know which aggregations to use? These will be set by the business case and usually there will be multiple aggregations needed. Read the business case carefully when you're designing your workflow.

6. Using the right aggregations

Here is some guidance for the most used nodes, including links to the relevant pages on the KNIME hub describing the nodes in detail, starting with the Row Aggregator. This is used in the simplest cases, where you need to aggregate over the whole dataset in one shot. It can also be used in the most complicated cases if you need to make advanced calculations on a subset of the data and then aggregate it.

7. Using the right aggregations

Next is the Groupby node - which is very common. This is used if you need to aggregate by categories based on repeated values in string columns. There will be one row per category.

8. Using the right aggregations

I'll finish with the commonly used Pivot node. This adds an extra layer of complexity - in addition to the grouping, the pivot transforms values into additional aggregated columns.

9. Using the right aggregations

When building the workflow, don't be surprised if your first results don't match what you are looking for - this is not unusual. You will probably need to experiment to achieve good answers. At this stage, check the results with the business case owner and adapt if necessary.

10. Let's practice!

Time to merge and aggregate the data; your last step before completing the business case with viewing and exporting - let's go!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.