Get startedGet started for free

Column cleaning and standardization

1. Column cleaning and standardization

Let's examine how to transform tabular data to improve data quality.

2. Book sales dataset

We'll continue to use our dataset of bestselling books.

3. Examining data types

Before cleaning data, we need to understand how each column is stored. Using Tablesaw's `Table.read().csv()`, we load our bestsellers dataset. We then examine column types using a loop: `books.columnNames()` gets all column names, and `books.column().type()` reveals how each column is stored.

4. Examining data types: outputs

The output shows text stored as STRING, years as INTEGER, and prices as DOUBLE. Notice that sales figures are stored as INTEGER since they're whole numbers - this might need conversion if we later need decimal precision.

5. Converting numeric formats

One way to clean data is to derive meaningful metrics from existing columns. We first need compatible numeric types. Since Tablesaw inferred `Sales_in_Millions` as an integer, we use `intColumn().asDoubleColumn()` to convert it to a double, matching our `Average_Price` column's type. Note that it is possible to convert a DoubleColumn to an IntColumn using `.asIntColumn()`. Next, we'll use these properly-typed columns to calculate total revenue.

6. Multiplying columns

To calculate total revenue, we multiply sales by average price. `DoubleColumn` enables column-wise math operations with `.multiply()`. After naming our new column with `.setName()`, we add it to the table using `.addColumns()` and print the results.

7. Multiplying columns: outputs

The output shows the calculated revenue for each book.

8. Regex to match parentheses

Special characters and annotations often need cleaning. Let's expand our regex knowledge by looking at an example of removing text in parentheses using regex. The regex pattern has three parts: `\\(` matches an opening parenthesis, `.*` matches any characters in between, and `\\)` matches the closing parenthesis. This pattern will match and remove text in parentheses. Now let's see how to apply this pattern using `.replaceAll()`.

9. Handling special characters

The `.map()` method transforms each title in our column. Within the map, we use `.replaceAll()` to remove parenthetical translations and `.trim()` to clean up spaces. The output shows an example title with the parentheses removed.

10. Combining string operations

Often we need to apply multiple cleaning steps to text data. Here, we chain three operations on our book titles: removing parenthetical translations with `.replaceAll()`, removing extra spaces with `.trim()`, and converting to lowercase. The `.map()` method lets us combine these operations efficiently. The output shows how the book title is transformed after all cleaning steps.

11. Putting it all together

We've explored several key techniques for cleaning tabular data. We start by checking data types with `.type()` and converting them using methods like `.asDoubleColumn()`. For numeric data, we can perform calculations like multiplying columns. Text cleaning uses `.map()` with either method references (like `String::toLowerCase`) for simple operations or lambda expressions (like `.replaceAll()`) when we need parameters like regex to remove special characters. These operations help prepare our data for reliable analysis.

12. Let's practice!

Now you can practice transforming tabular data!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.