Get startedGet started for free

Data validation best practices

1. Data validation best practices

Let's take a deeper dive into data validation.

2. What we will cover

Most data transformations happen during the preprocessing stages of a project, so let's start there. The most common validation strategies for structured data involve subgroup analysis, handling missing values, outlier removal, and correcting data inconsistencies. Let's also mention feature scaling, encoding, and dimensionality reduction.

3. Subgroup analysis

The most common way to validate preprocessing steps is to perform subgroup analysis. We divide the dataset into subgroups based on protected characteristics to do this. We evaluate each subgroup's statistical distribution and model performance separately and compare the results. We also compute model fairness metrics for each group and evaluate the results based on the selected fairness approach, such as equal opportunity, demographic parity, or disparate impact approach. The validation is complete if the model's predictions are consistent and fair across all groups, and we will proceed to the next preprocessing step. If not, we investigate and apply mitigation strategies.

4. Missing data

Missing data is common in large datasets and strategies to address it range from data deletion to imputation using mean values for numerical data or most frequent values for categories. Alternative methods, such as k-nearest neighbors (KNN) or regression, are model-based. However, imputation can introduce or amplify bias, and subgroup analysis allows for checking if imputed values unfairly impact minority groups.

5. Outlier removal

Removing outliers aims to improve model performance. This is usually done by applying statistical methods such as z-scores, Interquartile Range (IQR), or robust scaling techniques. But removing outliers can introduce bias and needs to be validated to ensure fair treatment across data segments.

6. Data inconsistencies

Data quality is a big concern in modeling: data discrepancies, errors, or irregularities challenge the integrity and reliability of the model results. Traditional methods to address data inconsistencies include data standardization and the application of validation rules. To validate the results, we use the subgroup normalization technique. In this case, we don't apply a uniform normalization process across the entire dataset; instead, we normalize features within each subgroup separately. This prevents one subgroup from dominating others and minimizes bias.

7. Feature scaling

As part of preprocessing, we use feature scaling to transform input features to a similar scale before feeding them into a model. This step prevents biased outcomes and ensures no particular feature disproportionately influences the model's predictions. Validation of feature scaling is done by analyzing the distribution of scaled features across different groups to ensure the scaling process disadvantages no group.

8. Feature encoding

To validate feature encoding for bias-free and fairness, assessing how techniques like one-hot or label encoding affect model outcomes is crucial. This involves checking for introduced biases, information loss, and potential overfitting due to increased dimensionality. Regularization and dimensionality reduction methods can mitigate these risks, ensuring encoded features contribute positively to the model's fairness and accuracy.

9. Dimensionality reduction

Finally, dimensionality reduction is applied to reduce the number of input features while preserving the essential information. By reducing computational complexity, this step may improve model results but create bias and affect the representation of minority groups. So, it is recommended to use fairness-conscious techniques, such as t-SNE.

10. Financial advisor

In the "Financial Advisor" AI project, we preprocess "annual income" and "investment frequency" features to adjust outliers and scale them. These adjustments are done by considering the financial behaviors of minority groups, such as gig economy workers. A subgroup analysis is conducted on this demographic to validate the preprocessing steps. We compare equal opportunity metrics before and after preprocessing as part of subgroup analysis to ensure that removing outliers and scaling features does not inadvertently disadvantage this group.

11. Let's practice!

Now, it is time to practice!