Uniqueness constraints
1. Uniqueness constraints
Hi and welcome to the final lesson of this chapter. Let's discuss another common data cleaning problem, duplicate values.2. What are duplicate values?
Duplicate values can be diagnosed when we have the same exact information repeated across multiple rows, for a some or all columns in our DataFrame. In this example DataFrame containing the names, address, height, and weight of individuals, the rows presented have identical values across all columns.3. What are duplicate values?
In this one, there are duplicate values for all columns except the height column -- which leads us to think it's more likely a data entry error than an actual other person.4. Why do they happen?
Apart from data entry and human errors alluded to in the previous slide,5. Why do they happen?
duplicate data can also arise because of bugs and design errors whether in business processes or data pipelines.6. Why do they happen?
However they oftenmost arise from the necessary act of joining and consolidating data from various resources, which could retain duplicate values.7. How to find duplicate values?
Let's first see how to find duplicate values. In this example, we're working with a bigger version of the the height and weight data seen earlier in the video.8. How to find duplicate values?
We can find duplicates in a DataFrame by using the dot-duplicated() method. It returns a Series of boolean values that are True for duplicate values, and False for non-duplicated values.9. How to find duplicate values?
We can see exactly which rows are affected by using brackets as such. However, using dot-duplicated() without playing around with the arguments of the method can lead to misleading results, as all the columns are required to have duplicate values by default, with all duplicate values being marked as True except for the first occurrence. This limits our ability to properly diagnose what type of duplication we have, and how to effectively treat it.10. How to find duplicate rows?
To properly calibrate how we go about finding duplicates, we will use 2 arguments from the dot-duplicated() method. The subset argument lets us set a list of column names to check for duplication. For example, it allows us to find duplicates for the first and last name columns only. The keep argument lets us keep the first occurrence of a duplicate value by setting it to the string first, the last occurrence of a duplicate value by setting it the string last, or keep all occurrences of duplicate values by setting it to False. In this example, we're checking for duplicates across the first name, last name, and address variables, and we're choosing to keep all duplicates.11. How to find duplicate rows?
We see the following results -- to get a better bird's eye view of the duplicates,12. How to find duplicate rows?
We sort the duplicate rows using the dot-sort_values method, choosing first_name to sort by.13. How to find duplicate rows?
We find that there are four sets of duplicated rows, the first 2 being complete duplicates of each other across all columns, highlighted here in red.14. How to find duplicate rows?
The other 2 being incomplete duplicates of each other highlighted here in blue with discrepancies across height and weight respectively.15. How to treat duplicate values?
The complete duplicates can be treated easily. All that is required is to keep one of them only and discard the others.16. How to treat duplicate values?
This can be done with the dot-drop_duplicates() method, which also takes in the same subset and keep arguments as in the dot-duplicated() method, as well as the inplace argument which drops the duplicated values directly inside the height_weight DataFrame. Here we are dropping complete duplicates only, so it's not necessary nor advisable to set a subset, and since the keep argument takes in first as default, we can keep it as such. Note that we can also set it as last, but not as False as it would keep all duplicates.17. How to treat duplicate values?
This leaves us with the other 2 sets of duplicates discussed earlier, which are the same for first_name, last_name and address, but contain discrepancies in height and weight. Apart from dropping rows with really small discrepancies, we can use a statistical measure to combine each set of duplicated values.18. How to treat duplicate values?
For example, we can combine these two rows into one by computing the average mean between them, or the maximum, or other statistical measures, this is highly dependent on a common sense understanding of our data, and what type of data we have.19. How to treat duplicate values?
We can do this easily using the groupby method, which when chained with the agg method, lets you group by a set of common columns and return statistical values for specific columns when the aggregation is being performed. For example here, we created a dictionary called summaries, which instructs groupby to return the maximum of duplicated rows for the height column, and the mean duplicated rows for the weight column. We then group height_weight by the column names defined earlier, and chained it with the agg method, which takes in the summaries dictionary we created. We chain this entire line with the dot-reset_index() method, so that we can have numbered indices in the final output. We can verify that there are no more duplicate values by running the duplicated method again, and use brackets to output duplicate rows.20. Let's practice!
Now that we have a solid grasp of duplication, let's practice.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.