Data cleaning

1. Data cleaning

This video will review some common transformation steps during the ETL or ELT processes used to clean the data.

2. Video agenda

In this video, we will shed more light on each in this list. This is a partial list. When implementing the data transformation process, the implementation team will evaluate the organizational requirements and use a combination of these four processes and others to meet the needs.

3. Data format cleaning

First, we will review data format cleaning. The data warehouse receives data in all types of formats. For example, values of dates or names might be abbreviated, have different capitalization, or be formatted differently in the input sources. However, at the end of the transformation phase, we want clean and consistent data for the data warehouse. Therefore, it is essential to use organizational rules to format the data so that the output is consistent. Let's look at some taxi cab user data to add more context. In it, we have data from two sources that we want to join together. Unfortunately, within the first table, the capitalization of the user's names is inconsistent. Also, the date format in the other table is not what we want. So we cleanse the data by adjusting both formats in the final joined table.

4. Address parsing

Parsing addresses is another typical cleansing operation. This process takes an address and divides it into individual components, like city and state. This helps standardize the data. Additionally, many ETL tools can automatically perform address lookups and validate the address.

5. Data validation

Next, let's discuss data validation. Two common validation checks include range and type checks. First, range checks to see if the data values are in the proper range. For example, you would not expect to see a person's age entered as 300 years old. This would fail the validation check. Another validation check looks at the data type of the value. For example, imagine the data from our source system returned a data type of string for customers' ages. Normally we would expect an integer data type. Therefore, the type check would flag them as invalid. Here we could change the data type to something expected. However, it is best to fix these issues in the input source if possible, but if not, maybe there is an organizational rule that can be used to set a default value or exclude rows that fail validation checks.

6. Duplicate row elimination

Finally, the deduplication step removes repeated rows of data. In this example of hospital doctor data, we have two tables. We want to append the rows on the left to the bottom of the table on the right. However, we do not want duplicate rows. Deduplication will flag that doctor id 275 is a duplicate and will only append rows 300 and 310.

7. Data governance

In summary, we reviewed four different data-cleaning processes that can be done in the ETL or ELT step. However, an organization with a strong data governance program should reduce the need for some of these cleaning processes. Data governance looks to develop rules and definitions around the data and detect and correct the data that deviates from its definition. Therefore, by the time it gets to the ETL or ELT process, less data cleaning needs to be done.

8. Let's practice!

Okay, let's review!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.