Data cleaning and transformation
1. Data cleaning and transformation
Hello - Andrew here. Now that you have a plan and the data, it's time to clean it up and get it into the right shape to answer the business case.2. Sticking to the plan
Following the process: You need to make the raw data clean, useful, and consistent. You start with cleaning, which is needed to ensure unique and consistent data that has no avoidable errors. Next is transformation, which is needed to make useful columns that will help answer the business case.3. Sparkling clean
The most common nodes used for cleaning are the Missing Value, Duplicate Row Filter, and Column Filter nodes. The String Cleaner node is used both for cleaning - eg white spaces and punctuation, and transformation - eg making case consistent. As always, remember to annotate.4. Sparkling clean
In a real-world case, you will often need to do something special. Do not worry that you don't know all the nodes - use the KNIME forum or your favorite search engine to find out how others have solved the problem in KNIME and re-use! This applies anywhere in the workflow.5. Making useful changes
You'll need to manipulate the cleaned data so you can answer the business case - in other words, making the data useful. This usually includes string manipulation with many different possible operations - here, we see a possible node arrangement.6. Making useful changes
You may have noticed the Expression node, which is used in the String Manipulation and Numerical Calculation examples, as well as the String Manipulation and Math Formula nodes. These are examples of the multiple ways of achieving the same result in KNIME; often, there is more than one right answer.7. Making useful changes
Other transformations include renaming columns and many other operations. As always, annotate your workflow to ensure clarity and re-usability!8. Easier to understand and re-use
Remember that in the last chapter, I mentioned the idea of metanodes as towns on the map? It's time to explore this further.9. Easier to understand and re-use
At this point, your workflow will start to look busy with many nodes and often with multiple paths that do similar things. This is a good time to start using metanodes which are multiple nodes packaged into a single node. The gif shows how the metanode process works.10. Easier to understand and re-use
Creating and using metanodes will help you save time by copying one metanode rather than multiple nodes. Metanodes make your workflow easier to understand since they are less busy and easier to follow. You'll also save time on annotations by adapting where necessary rather than starting from the beginning with each node.11. Let's practice!
I'm sure you're excited to start cleaning and transforming the London Fire Brigade data - let's go!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.