1. Kimball's four step process
In the last video, we reviewed the star and snowflake schema data models. In this video, we will review Kimball's four-step process to help us design these schemas.
2. Step 1 - Select the organizational process
The first step is to select an organizational process for which we want to create a data model, such as invoice and billing, product quality monitoring, and marketing. In each of these processes, analysts and others will want to ask questions about the process and know how well they are performing.
Recall that the Kimball bottom-up approach first models one department or business process to create a data mart.
3. Step 2 - Declare the grain
In step two, we decide on the grain. The grain here means the level of data stored in the fact table. It's not required, but it advised that it is the lowest level possible. Essentially, a level where you cannot split the data any further.
For example, if we were building a warehouse for a music service, we should choose the grain level to be individual songs versus a full album from a band. If we were tracking shipping data, we should store the individual line items on a shipment versus the overall shipment. This will allow organizational users to analyze the data to ask questions like "Which song is the most popular?" or "Which products were delayed the most in shipping?". Organizational users may be unable to answer the questions they have if we choose the wrong grain here, making the info and data warehouse less valuable or useless to them in extreme circumstances.
4. Step 3 - Identify the dimensions
In this step three, we want to choose the dimensions that apply to each fact table row. Overall, we want to add a rich set of dimensions representing as many characteristics as possible of the fact table data. A data model may have a unique set of dimensions, but there are some that are typically used. For example, tracking the year, month, and the quarter is typical for any process that involves time. Likewise, tracking the address, state, and country is common for any process involving location. Also, it is standard to include the dimensions of name and email address for any process involving users. Answering the question, "How do organizational users describe the data that results from the business process?" helps in this step.
Also, it would be valuable for us to have feedback from analysts and other users who often work with the data.
5. Step 4 - Identify the facts
In step four, we want to identify the numeric facts that will populate each fact table row. Examples of possible facts for a fact table for a music service include the total number of plays or sales revenue for a song. However, if we were creating a fact table for a ride-sharing service, examples of possible facts are the distance traveled or the time needed. These facts capture a measurement or metric of the process.
Having members of the organization answer the question "What are we answering?" will help us identify numerical facts to include. Recall the purpose of the data warehouse is for reporting and analysis of the organizational process. We will need to decide on the vital numerical measures to store to answer likely questions about the process later. Finally, the fact or metric chosen should be valid at the grain selected in step two.
6. Summary
The steps here help us design a data model for our star or snowflake schemas. It is vital to work with organizational users to gain feedback that will make the model relevant for them.
7. Let's practice!
Okay, practice time!