Get startedGet started for free

Data validation

1. Data validation

Data validation is done during a data audit. Let's have a closer look at it.

2. What we will cover

We'll discuss what data validation is and how it relates to the responsible dimensions, and we will apply it to our fictional AI financial advisor. In the next video we will look at common validation steps in preprocessing, modeling and post-deployment.

3. Data validation

Data auditing is a broader review of data management within a project. Data validation takes a more granular approach to check and evaluate the data's technical integrity and fairness.

4. Technical integrity

We start by checking the technical integrity of the data. We ensure the data quality is still there by looking for complete data and checking for duplicates, errors, or anything outdated or incorrect. Ensuring this data quality is usually necessary to remain legally compliant. Checks for accuracy, consistency, completeness, and timeliness help us remain aligned with the responsible data management practices we've learned about.

5. Financial advisor: technical integrity

Let's go back to our financial AI app. Perhaps the user uploaded some documents that included discrepancies and anomalies. They could have uploaded statements from different banks in different formats. This could lead to some fields being incorrectly assigned as income or expenses, leading to incorrect or biased results. Through technical data validation checks, we could spot and correct this early.

6. Financial advisor: technical integrity

Anomalies in data collection periods and methods between datasets are common. The historical data used in the app may not account for regional economic disparities accurately: regions with lower economic stability tend to have lower data accuracy. Uncorrected, the app might develop investment strategies that are not optimized for all user groups, unfairly disadvantaging users in less economically stable regions.

7. Fairness assessment

Data validation is also when we can apply the metrics to test for responsible AI and ethics. Recall we can use "equality of opportunity," "disparate impact," or "demographic parity" approaches. These metrics can be applied to all project stages, including data preprocessing, algorithm design, and model evaluation. In the financial AI app, if "fairness" is equality of opportunity, all individuals, regardless of their protected group status, should receive financial advice that is equally relevant and beneficial. Should our fairness metric be demographic parity, we aim to achieve a uniform distribution of financial product recommendations across all demographic groups. Lastly, suppose we adopt the disparate impact approach. In that case, the app should minimize any unintended biases and ensure that no group is disproportionately affected by the financial strategies generated by our app.

8. Data validation approaches

Particular approaches to validate data depend on the project and its context, but the most common ones are: to identify key variables that could potentially introduce bias or affect representation, such as age, gender, and income level, to analyze data distribution to identify any imbalances, to clean the data and apply statistical tests to assess fairness and bias, to use oversampling or undersampling to balance the dataset, and to check fairness metrics in model evaluation and test models with diverse data sets.

9. Financial advisor

In our Financial advisor project, we preprocess data to remove outliers and impute missing values. To validate data, we compare descriptive statistics before and after, use oversampling for low-income groups, and check model performance and fairness metrics. We use cross-validation with stratification for protected groups to evaluate if the model's performance across different demographic groups remains consistent after preprocessing. We use selected fairness metrics and assess the model on the "unseen" data.

10. Let's practice!

Now, let's practice!