Categorical standardization

1. Categorical standardization

Why do we need to standardize categories?

2. Why standardize categories?

We've extracted and modified the Category and Quantity columns from our grocery inventory dataset. However, inconsistent category names like "Fruits & Vegetables" appear with different spellings and formatting, making it hard to accurately group data. To fix this and get the correct stock totals, we need to standardize the categories.

3. Steps for standardization

The first step is to define standard categories like FRUITS_AND_VEGETABLES, DAIRY, and UNCATEGORIZED. Next we map expected variations of these categories to the standard categories. Then we can perform analysis, like counting categories.

4. Define standard categories with enum

First we define standard categories. An `enum` gives us a fixed set of valid categories like `FRUITS_AND_VEGETABLES`, `DAIRY`, and `UNCATEGORIZED`, preventing typos and ensuring consistency.

5. Map raw categories to standard categories

With our standard categories, we need to map variant forms to their standard versions. We create a `HashMap`, using `.put()` to connect each variant of "Fruits & Vegetables" to its standardized `enum` value `ProductCategory.FRUITS_AND_VEGETABLES`.

6. Map variations to standard categories: outputs

The `.get()` method lets us look up the standard category for any variant form.

7. Handling unknown categories

What happens when we encounter a category that isn't in our mapping? When we lookup an `unknownCategory` in `rawCategories`, the `.getOrDefault()` method returns a default category of `ProductCategory.UNCATEGORIZED` if it can't find `unknownCategory`.

8. Making the mapping immutable

Once we've defined our mappings, we can prevent accidental modifications. The `Collections.unmodifiableMap()` method creates a read-only view of `rawCategories`; it wraps `rawCategories` so that we have a safe, public-facing version while we can still change `rawCategories`. Any attempt to modify `categories` will throw an `UnsupportedOperationException`, such as trying to standardize a "new" category as `ProductCategory.DAIRY`.

9. Extract categories from our dataset

Let's see how standardization enables us to group categories. First, we create `rawData` using `Map.of()` to store our initial category-quantity pairs.

10. Lookup standard category

We then initialize an `EnumMap`, which is a type of map optimized for `enum` keys. Using `.forEach()`, we iterate through each raw category and quantity pair. For each pair, `categoryMap.get(raw)` converts the raw category string to its standardized value. The `.merge()` method either adds a new entry to `stockByCategory` or updates an existing one by combining quantities using `Integer::sum`.

11. Sum over standard category

After converting all variations of "Fruits & Vegetables" to the standardized category, the total quantity is 60. Without this standardization, these variant spellings would remain as separate categories with split totals.

12. Grouping by standard category: summary

As a summary, we first extracted a `rawData` map of category/quantity from our dataset. Then we created `categories` to lookup standard categories. Finally, we computed `stockByCategory` by looking up `rawData` in `categories`.

13. Putting it all together

We've learned to standardize categories for accurate data analysis. First, we defined valid categories using an `enum`, Then, we mapped variant forms to standard ones using a `HashMap`, protected it with `Collections.unmodifiableMap()`, handled unknowns using `.getOrDefault()`, and grouped data accurately with `.merge()`. This ensures reliable aggregation.

14. Let's practice!

Time to practice standardizing categories!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.