1. Uniqueness constraints
Sometimes we might have data containing duplicated data points caused by an error earlier in the pipeline. Duplicated data can have lots of unwanted side effects, but uniqueness constraints help us find and fix these duplicated data points.
2. What's a duplicate?
A duplicate is a data point that has the same values as another data in all columns, or in most columns.
3. Why do duplicates occur?
Duplicates can arise from data entry errors or other human errors,
4. Why do duplicates occur?
from bugs or design errors in data pipelines,
5. Why do duplicates occur?
but they most commonly arise from errors in joining data together from multiple sources.
6. Full duplicates
A full duplicate occurs when there are multiple rows that have the same value in every column. Here, rows 2 and 4 are full duplicates of each other, and rows 5 and 6 are also full duplicates of each other.
7. Finding full duplicates
We can find full duplicates using R's built-in "duplicated" function. If we pass in our data frame, called credit_scores, we get a logical vector, which is TRUE for each row that is a duplicate. If we take the sum of this vector, we get the total number of duplicates in the dataset, which is 2.
8. Finding full duplicates
We can check out the rows that are duplicates by using the duplicated function inside of filter, like this.
9. Dropping full duplicates
Since full duplicates are exact copies of each other, we can deal with these by simply removing them from the dataset, leaving only one copy behind.
This can be done using dplyr's distinct function on our data frame. If we count the number of duplicates in the dataset now, we get 0.
10. Partial duplicates
Partial duplicates are rows of data that have much of the same information and refer to the same data point, but there may be some columns that differ between them. Here, rows 3 and 4 are partial duplicates of each other and rows 5 and 6 are partial duplicates of each other. They have the same name and address, but different credit scores. This can happen due to data entry errors or technical errors. In this example, some data from a year ago may have gotten mixed with the data from this year.
11. Finding partial duplicates
To include partial duplicates in our search, we can use dplyr's count function to count the number of occurrences of each pair of first name and last name, then filter for the pairs that occur more than once. This will give us data points that share the same name, but may differ in address or credit score.
12. Finding partial duplicates
We can save this as a data frame called dup_ids and then filter to take a look at all the rows that have a duplicate. This gives us a list of the full duplicates and the partial duplicates in the dataset.
13. Handling partial duplicates: dropping
One way to handle partial duplicates is to
14. Handling partial duplicates: dropping
drop all duplicates
15. Handling partial duplicates: dropping
and keep only one of them.
16. Dropping partial duplicates
In R, we can also use the distinct function to drop partial duplicates, but this time, we pass in the column names that duplicates should be matched on. Here, it's first name and last name. We also set the dot-keep_all argument to TRUE, which will keep all columns of the data frame instead of only first_name and last_name.
17. Handling partial duplicates: summarizing
Another way to handle partial duplicates is to summarize them using a summary statistic, like the mean or max. Since the credit scores differ in our example,
18. Handling partial duplicates: summarizing
we can take the mean credit score for each individual
19. Handling partial duplicates: summarizing
keeping only one copy of each duplicate,
20. Handling partial duplicates: summarizing
and use the mean as the value for credit score.
21. Handling partial duplicates: summarizing
22. Summarizing partial duplicates
To summarize partial duplicates, we group by first_name and last_name, and then mutate, adding a column containing the mean credit score for each name. Since we've grouped before mutating, the mean_score column contains the mean score for each person.
23. Summarizing partial duplicates
We can pipe all of this into distinct, using first and last name and setting dot-keep_all to TRUE to remove the extra duplicates. We can also remove the old credit_score column to clean things up.
24. Let's practice!
Time to dive in to some duplicated data!