Data ingestion
1. Data ingestion
Now, let's discuss the data and model pipeline components!2. ETL process
We'll start with the data ingestion process.3. ETL process - source
Data ingestion, or in short, ETL, is the process of4. ETL process - extract
extracting data from the API,5. ETL process - transform
normalizing it, and6. ETL process - load
loading it to the database. This enables us to use it with minimal transformation when creating the forecast.7. ETL process - data validation
This process includes data validation steps between each component8. ETL process - data integrity
to prevent data integrity issues.9. ETL process - logging
We log results throughout this process. These logs are critical for monitoring pipeline health and identifying issues as they occur.10. ETL process - refreshing
To maintain parity between the data on the API and the normalized table, we will11. ETL process - refreshing
use a function that12. ETL process - refreshing
compares the last timestamp available on the API and the one we captured on the log table.13. ETL process - refreshing
If new data is available, the function14. ETL process - triggering a new pipeline
will trigger the refresh process,15. ETL process - updating normalized data
and bring back the normalized table to parity with the data on the API. Let's review the implementation of those steps in the DAG design.16. ETL process - data parity check
Starting with the data parity check, we send a GET request to the API to pull the series metadata.17. ETL process - metadata
We will extract the series' last timestamp from the metadata and compare it with the normalized series's metadata to determine whether the tables are at parity.18. ETL process - data refresh
If new data is available, it will trigger the data refresh process to pull and normalize the incremental data. If the data refresh completes successfully,19. ETL process - data validation
we will conduct a set of data validation tests,20. ETL process - data analysis
analyze the validation results, and if completed successfully21. ETL process - append the data
append the incremental data to the normalized table. Once the data refresh is completed, we will trigger the forecast refresh process.22. ETL process - forecast refresh
This includes creating a new forecast if the refresh condition is met and appending it to the forecast table.23. ETL process - forecast evaluation
The last step is to score previous forecasts and log the data. Note that DAG contains three branches that enable us to terminate the pipeline run if:24. ETL process - no update
no new data is available25. ETL process - refresh failure
the refresh process failed, or26. ETL process - validation failure
when the data validation failed. Data validation checks are critical for ensuring data quality.27. Data validation checks
Creating a data validation process is based on our knowledge and expectations of the data. Typically, we validate the data schema, value ranges, missing values, duplications, and business logic. Python offers various frameworks for data validation, such as `great_expectations`, `ydata_profiling`, and `pointblank`.28. Data validation checks
In the following example, we will use `pointblank` to demonstrate a validation process on data we pulled from the API:29. Data validation checks
We start by importing `pointblank` as `pb`. Next, we define the expected schema structure of our data using the `pb.Schema()` function, providing the column names and attributes.30. Data validation checks
We use the `Validate()` function to define our validation object, and chain the following validation checks: Schema validation using the schema object we defined earlier, the series values aren't negative. values of the respondents, and type columns. Check that the series index and value columns do not contain nulls. Check for duplicate rows. And, execute the validation.31. Data validation checks
The validate function returns a nice summary table with the validation results.32. Data validation checks
In the deployment, we will use the `all_passed` method, which returns True if the process completed successfully and False otherwise.33. Let's practice!
Time to validate your data!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.