1. Cross field validation
In this lesson, we'll talk about using cross field validation to find dirty data.
2. What is cross field validation?
Cross field validation is essentially a sanity check on your data to check that one value makes sense based on other values in your dataset.
Here, 52% and 58% add up to 110%, which doesn't make sense in this context, so we know that one or more of these values are wrong.
3. Credit card data
Here's a data frame called credit_cards, which contains information on credit card accounts, including the date it was opened, the amount of cash back the account has received on dining, groceries, and gas, as well as the total cash back and the age of the account.
4. Validating numbers
For this type of credit card, there are only three categories of purchases that you can earn cash back on, so we know that the three categories should add up to the total.
5. Validating numbers
To find any accounts where the categories don't sum up to the total, we can first create a new column called theoretical_total, which contains the sum of dining, grocery, and gas cash back. Then, we can filter for rows where the theoretical_total is not equal to the total_cb, or the actual total.
We have two rows that don't add up correctly.
6. Validating date and age
We can also validate dates and ages in a similar way, since we know that the difference between today's date and the date_opened should be the account's age.
7. Calculating age
To calculate age in years, we take the origin date, in this case Sept 9, 2015, and use the percent-minus-minus-percent operator followed by the today function. This will calculate how much time has passed between the origin date and today.
However, this returns a special data type, so we'll need to convert it to numeric using as-dot-numeric. We'll pass "years" in quotes as the second argument to convert the time interval into a number of years.
Finally, if we want only the number of years, we can use the floor function to get rid of the decimals.
8. Validating age
We can use the same mutate and filter strategy as we used when validating the cash back totals. We'll create a column called theor_age to hold the calculated age, and compare it to the account age.
We have two accounts that have mismatched ages.
9. What next?
Once we find this invalid data, what do we do with it?
Just like other data cleaning problems, there is no one-size-fits-all solution, since finding the best solution is dependent on the dataset.
We can decide to drop inconsistent data, set it to missing and impute it, or apply some rules using domain knowledge, but in order to know what the best approach is, we have to have a good understanding of where the dataset comes from and the different sources feeding into it.
10. Let's practice!
Time to do some cross field validation on your own!