String normalization

1. String normalization

String normalization is a key concept for transforming data.

2. Grocery inventory dataset

Throughout the chapter, we'll work with subsets of a grocery inventory dataset. We may modify certain columns like the product name to illustrate data transformations.

3. Why string normalization matters

Let's assume that we've extracted some items from the product name column into an array of `messyProducts`. Messy strings may prevent computers from recognizing text. Our product data shows the following common problems: trailing spaces in "Eggplant ", all capital letters in "VEGETABLE OIL" versus mixed case elsewhere, special characters in "Fresh (Organic) *Carrots*", and multiple spaces in "Bell Pepper Fresh". While humans recognize that "Eggplant" with and without trailing space refer to the same item, computers see them differently. Let's clean each type of messiness by normalizing strings.

4. Removing leading and trailing whitespace

Starting with whitespace, we see unnecessary spaces at the start and/or end of items in `products`. The `.trim()` method provides a simple solution, removing spaces from both ends.

5. Standardizing text case formats

Now let's standardize letter case. In `products`, "VEGETABLE OIL" is all uppercase, while other names use different cases. `.stream()` processes each item one at a time. `.map(String::toLowerCase)` converts each item to lowercase. Finally we print each item with `.forEach(System.out::println)`. Transforming all names to lowercase ensures they match as the same product.

6. Regex patterns

Regular expressions (regex) use special patterns to match text. Let's break down an example regex that finds any character that isn't a letter or space: square brackets [] create a character set, the caret ^ means "not", a-z matches lowercase letters, A-Z matches uppercase letters, and \\s matches spaces.

7. Cleaning special characters

Let's remove special characters like parentheses and asterisks. The `.replaceAll()` method uses our example regex to match any character that isn't a letter or space and remove it. This transforms "Fresh (Organic) *Carrots*", keeping only letters and spaces.

8. Cleaning multiple spaces

Pattern matching gives us powerful tools for finding and replacing text. The `Pattern` class creates templates for text matching, and our pattern `\\s+` matches one or more spaces. `pattern.matcher()` applies the `pattern` to the `messy` string. `.replaceAll()` substitutes the pattern with a single space. This removes extra white spaces from "Bell Pepper Fresh." While we could use `messyProduct.replaceAll()` directly, creating a `Pattern` object lets us reuse the compiled regex efficiently across multiple strings.

9. Putting it all together

Let's apply all of our normalization operations. The `messyProducts.stream()` method processes each messy product name in sequence. For each string, we apply our cleaning steps in order: `.trim()` removes outer spaces, `.replaceAll()` matches and removes any character that isn't a letter or space, another `.replaceAll()` standardizes spaces between words, and `.toLowerCase()` converts to lowercase. Finally, `.forEach()` prints each cleaned result.

10. Putting it all together: outputs

The output shows each product name in a consistent, clean format - perfect for reliable text matching.

11. Let's practice!

Now you can practice normalizing strings!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.