Data Denormalization

1. Data Denormalization

In this video, we'll describe normalization and de-normalization, key mechanisms to transform data in a Fabric lakehouse or warehouse

2. What is Normalization?

Normalization refers to the process of structuring data in a way that reduces data redundancy and improves data integrity. Normalization is based on principles set in 1970 by computer scientist Tedd Codd, the inventor of the relational model that forms the basis for relational databases. Codd formulated various principles of normalization. In general, we say that a model is normalized if it meets Codd's Third Normal Form, which is when all non-key attributes depend only on the primary key.

3. What is Normalization?

In simple terms, normalization is achieved by using keys and new tables to replace attributes that otherwise would lead to data redundancy.

4. Normalization example

This is best illustrated with an example. Consider the following table storing information about video games. Each game has a unique ID and title. However, the same publisher and genre could appear multiple times in the table. Multiple entries with the same values of publisher and genre increase the size of the table and also increase the chances of integrity errors.

5. Normalization example

To reduce redundancy and increase integrity, we could create a separate master table for publishers and one for genres, adding keys to each publisher and genre and using those keys to replace the corresponding entries in the games table. The resulting tables will take less space because the text descriptions of publisher and genre in the games table have been replaced with integer numbers, which take less space. Each publisher and genre now appears only once in the new master tables, increasing data integrity.

6. Denormalization

Denormalization is the opposite of normalization. Instead of reducing redundancy, denormalization uses redundancy to flatten the data model, resulting in fewer tables.

7. When should you use Normalization?

Normalization is typically used for OLTP transactional systems, where the focus is on optimizing data writes of individual inserts, deletes and updates. For data warehouses and analytical systems, normalization is best used for fact tables. These tables tend to be very large, so reducing redundancy on tables that have millions of records leads to significant savings in storage space. Normalizing fact tables leads to the addition of dimension tables, forming the basis of the star schema.

8. When should you use Denormalization?

Compared to fact tables, dimension tables tend to be much smaller in size. Adding redundancy to dimension tables reduces the number of joins required, leading to faster queries. The cost of storing redundant data on dimension tables is generally outweighed by the improved query performance. Star schemas use denormalized dimensions, which are simpler to use and understand than the normalized dimensions used by snowflake schemas.

9. Implementing denormalization

There are various tools you can use to implement data denormalization. These include SQL, Spark and Dataflows.

10. Implementing denormalization with SQL

Denormalization can be implemented in SQL by joining tables using the SELECT statement. This example shows a query joining the normalized games table with genres and publishers tables.

11. Implementing denormalization with Spark

In Spark, you can load the data into Spark DataFrames and then use the join method to join DataFrames, and the select method to select the desired columns to produce a denormalized table.

12. Implementing denormalization with Dataflows

Finally, you can use the Merge queries or Merge queries as new transformation in a Dataflow to join queries and produce a denormalized table.

13. Let's practice!

Now, let's do a couple of exercises to put this in practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.