Get startedGet started for free

Normalizing Relational Data

1. Normalizing Relational Data

Hello again. We will now immerse ourselves in data normalization. We use this approach in data modeling to ensure integrity. We can think of it as tidying up our data, ensuring everything is in the right place and easily accessible.

2. Understanding unnormalized data

As we explore the data modeling world, we encountered a silent challenge: the unnormalized data, which can also be called Unnormalized Form, commonly abbreviated as UNF. Unnormalized is when our data lacks a defined structure, or is disorganized, where there might be repetitions and anomalies.

3. Understanding unnormalized data

When examining the all products table, signs of unnormalized data become evident.

4. Understanding unnormalized data

For instance, the category attribute for Banana lists multiple values in a single row, A+ and A. This anomaly happens for many other products, making it challenging to identify individual products per category.

5. Understanding unnormalized data (3)

We can also see redundancy in the manufacturer of the product. The fruit assigned to a manufacturer has two attributes that repeat themselves: manufacturer_id and manufacturer_name. The id 104 will constantly be referencing Summer Harvest, making having both attributes in the same entity unnecessary.

6. Understanding unnormalized data (4)

Other anomalies are errors in values inserted.

7. Understanding unnormalized data (5)

Or the duplicated products, which indicates this is not a unique identifier for our entity.

8. Problems with unnormalized data

Unnormalized data presents two significant issues: redundancy and anomalies. Redundancy means unnecessary data repeats, leading to wasted storage space, complicating data updates, and causing inconsistencies. These issues can cause more problems with performance, complicating maintenance, integrity risks, interdependencies, and others.

9. Identifying unnormalized data

Let's learn how to spot unnormalized data using Snowflake. We have to look for specific patterns, like repeating values or dependencies on attributes that are not the primary key. In our entity, we can use the GROUP BY clause to aggregate data by specific columns. Then, we can apply the COUNT function to get the number the DISTINCT entries for each product. Let's name this count AS repetitions. By adding the HAVING clause, we can filter the results to show only those items with repeated values, helping us identify potential redundancies in our data.

10. Identifying unnormalized data

Let's SELECT all the DISTINCT category values. We encounter both single and combined categories. A category like 'L' is atomic, meaning it's a single, indivisible value. However, seeing 'L, A, AA' or 'B, L' indicates that products can be part of multiple categories. This complexity signals that our data model could benefit from further normalization to ensure each product is linked tom single categories.

11. Identifying unnormalized data

Let's see how this unnormalized data affects us during data retrieval. We will SELECT all products WHERE the category is 'L'. If we look at the results after filtering for category 'L', we only see the Lemon product listed.

12. Identifying unnormalized data

However, Kiwi and Raspberry should have also been in this category. This case indicates a complexity in how the data is stored and retrieved, making it vulnerable to anomalies during updates or when trying to query data.

13. Normalized data

After exploring issues related to unnormalized data, let's discuss normalized data and its benefits. Normalized data is organized into distinct entities. Each entity represents a single object or concept. It improves data accuracy and reliability by minimizing redundancy. This minimization supports consistency across the model, increasing business reliability. It enhances query performance, allowing efficient data retrieval for quicker business decisions. It also optimizes resource use by reducing duplication and saving storage, offering cost-effective solutions. Lastly, normalized data supports scalability because it can be easily adapted as a business changes and evolves.

14. Terminology and functions overview

Let's summarize the functions we have learned so far.

15. Functions overview

Take note of the implementations of the functions as well. The first query gathers unique values meeting a condition. The second query counts entries per unique column_name value, applying a condition.

16. Let's practice!

Let's practice identifying anomalies to ensure we can find to fix them!