1. Data quality and ingestion
Nice work on those exercises. Let's now go into data quality and ingestion.
2. Data quality and ingestion
Gathering data is part of the design phase. During this phase, we investigate the data quality and how we extract the required data.
3. What is data quality?
For machine learning, data quality refers to the measure of how well data serves its intended purpose. It is often evaluated through various dimensions.
The quality of a machine learning model is highly dependent on the quality of the data.
4. Data quality dimensions
We focus on some of the important dimensions along which data quality is defined, namely accuracy, completeness, consistency, and timeliness.
5. Data quality dimensions example
Let's say that we are building a predictive model for determining whether customers will churn. We can check the quality of customer data by going over the four dimensions.
An example of accuracy would be whether the data correctly describes the customer. It could be that the data states that a customer is 18, but the customer is actually 32. That would be inaccurate.
For completeness, we mainly look at missing data, for instance, whether we are missing last names of customers.
With consistency, we investigate whether the definition of a customer is consistent throughout the organization. It could be that one department has a different definition of an active customer than another, which makes the data inconsistent.
If we look at timeliness, we are interested in the availability of data. For instance, when the customer orders are synchronized daily, they are not available in real-time.
Having a lower data quality in one or more dimensions does not mean that the project is bound to fail. There are multiple things we can do to tackle low data quality, such as collecting more data or filling missing data with other sources.
6. Data ingestion
Typically, in the design phase, we also look into how to extract and process data. This is done by using an automated data pipeline, of which we see a high-level example here. A data pipeline is often a part within the machine learning lifecycle through which data is automatically processed.
A common type of data ingestion process is ETL, which stands for extract, transform, and load. It describes the three steps gone through in an ETL pipeline. The data is extracted from the source, transformed to the required format, and loaded into some internal or proprietary database.
In an ETL pipeline, we can also include automated checks, such as expectations we have about certain data columns. For instance, we expect that the temperature column always contains a number. Including these automated checks in a data pipeline helps speed up the development and deployment phase of the lifecycle, since faulty or low-quality data will negatively affect the machine learning model.
7. Let's practice!
Now that we've looked into data quality and ingestion, let's dive into some exercises!