Data ingestion

1. Data ingestion

Now, let's discuss the data and model pipeline components!

2. ETL process

We'll start with the data ingestion process.

3. ETL process - source

Data ingestion, or in short, ETL, is the process of

4. ETL process - extract

extracting data from the API,

5. ETL process - transform

normalizing it, and

6. ETL process - load

loading it to the database. This enables us to use it with minimal transformation when creating the forecast.

7. ETL process - data validation

This process includes data validation steps between each component

8. ETL process - data integrity

to prevent data integrity issues.

9. ETL process - logging

We log results throughout this process. These logs are critical for monitoring pipeline health and identifying issues as they occur.

10. ETL process - refreshing

To maintain parity between the data on the API and the normalized table, we will

11. ETL process - refreshing

use a function that

12. ETL process - refreshing

compares the last timestamp available on the API and the one we captured on the log table.

13. ETL process - refreshing

If new data is available, the function

14. ETL process - triggering a new pipeline

will trigger the refresh process,

15. ETL process - updating normalized data

and bring back the normalized table to parity with the data on the API. Let's review the implementation of those steps in the DAG design.

16. ETL process - data parity check

Starting with the data parity check, we send a GET request to the API to pull the series metadata.

17. ETL process - metadata

We will extract the series' last timestamp from the metadata and compare it with the normalized series's metadata to determine whether the tables are at parity.

18. ETL process - data refresh

If new data is available, it will trigger the data refresh process to pull and normalize the incremental data. If the data refresh completes successfully,

19. ETL process - data validation

we will conduct a set of data validation tests,

20. ETL process - data analysis

analyze the validation results, and if completed successfully

21. ETL process - append the data

append the incremental data to the normalized table. Once the data refresh is completed, we will trigger the forecast refresh process.

22. ETL process - forecast refresh

This includes creating a new forecast if the refresh condition is met and appending it to the forecast table.

23. ETL process - forecast evaluation

The last step is to score previous forecasts and log the data. Note that DAG contains three branches that enable us to terminate the pipeline run if:

24. ETL process - no update

no new data is available

25. ETL process - refresh failure

the refresh process failed, or

26. ETL process - validation failure

when the data validation failed. Data validation checks are critical for ensuring data quality.

27. Data validation checks

Creating a data validation process is based on our knowledge and expectations of the data. Typically, we validate the data schema, value ranges, missing values, duplications, and business logic. Python offers various frameworks for data validation, such as `great_expectations`, `ydata_profiling`, and `pointblank`.

28. Data validation checks

In the following example, we will use `pointblank` to demonstrate a validation process on data we pulled from the API:

29. Data validation checks

We start by importing `pointblank` as `pb`. Next, we define the expected schema structure of our data using the `pb.Schema()` function, providing the column names and attributes.

30. Data validation checks

We use the `Validate()` function to define our validation object, and chain the following validation checks: Schema validation using the schema object we defined earlier, the series values aren't negative. values of the respondents, and type columns. Check that the series index and value columns do not contain nulls. Check for duplicate rows. And, execute the validation.

31. Data validation checks

The validate function returns a nice summary table with the validation results.

32. Data validation checks

In the deployment, we will use the `all_passed` method, which returns True if the process completed successfully and False otherwise.

33. Let's practice!

Time to validate your data!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.