Get startedGet started for free

Cleaning data in KNIME Analytics Platform

1. Cleaning data in KNIME Analytics Platform

Let’s see some of the nodes of KNIME Analytics Platform to clean data and bring them to the desired shape. In a previous exercise, you have already met the Row Filter node, which filters the rows of a table according to a specified condition. Let’s see it more in depth. In the configuration window, you can select one or more columns to test, set an operator and a value. For example, filter all the rows where the Age is above 40 and the Number of Companies Worked is exactly one. You can also reverse the filter, and exclude the rows that match the criteria. If you want to make sure that a table only contains unique rows, use the Duplicate Row Filter node. You can test all the columns or search for duplicates in just a subset of them. By default, the node removes the rows that have the same values in all the included columns. You can also configure it to get a summary for each row, indicating if the row is unique or duplicate. Sometimes a table contains more columns that you need. You can get rid of them with the Column Filter node. Just select the columns that you want to exclude and move them to the left side of the menu. The question marks in a table represent missing values. You can take care of them with the Missing Value node. Replace it with a fixed value, a statistic from the column or even remove all the rows that contain missing values. This choice can be done for all the columns of a certain type, like all the string or numeric columns, or you can define a different replacement for a specific column in the Column Settings tab. This table does not contain the value “distance from home” for all the employees. You can fill the missing ones with the column mean. Recall that you can see the number of missing values in each column in the Statistics tab of the Node Monitor. The node monitor shows the column type under its name. If you notice that some of the columns have the wrong type, you can convert it with one of the converter nodes. For example, use the Number to String node to convert the type of one or more numerical columns. Finally, meet the powerful String Cleaner node. This node offers a variety of options to clean string values, like removing whitespace, removing punctuation, converting to lowercase and so on. Select the columns that you want to treat, and then enable the desired options. For example, make sure that the strings don’t contain non-ASCII characters, are free of numbers and whitespaces and are all capitalized. You can also add padding before or after the string, to make sure that they have a minimum length. You are now fully equipped to make your data shine. Try it out in the exercise!

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.