Handling duplicated data

1. Handling duplicated data

In our last lesson, we covered missing data. Now we will tackle the challenge of duplicated data.

2. Duplicate data

Duplication introduces several problems in your data. While data storage tends to be cheap, it is not free. Therefore, to conserve storage resources, avoiding multiple occurrences of data is helpful. Duplication can over-represent certain values in an analysis. Imagine calculating an average value for records with duplicate values. The average would not represent the actual expected value for these records.

3. Detecting duplicated data

According to the NYC.gov site hosting the restaurant inspection dataset, "establishments are uniquely identified by their CAMIS (record ID) number." However, the CAMIS number is duplicated in this dataset.

4. Detecting duplicated data

The query result reveals that this dataset contains 579 duplicated CAMIS values. Understanding the usage of the CAMIS value is important for determining how to handle this duplication. If rows with duplicate CAMIS values contain duplicated values across columns, this data may be unnecessarily duplicated.

5. Detecting duplicated data

Adding the name and boro columns to the SELECT list and grouping by all 3 columns identifies the same 579 duplicated groups. Considering just these 3 columns appears to indicate the presence of unnecessary duplication. When the inspection_date column is added to the results, the number of records sharing the same set of values for these 4 columns drops to 83 records. The inspection_date column contains valuable information. We want to preserve the information across records with the same camis value.

6. Detecting duplicated data

When including the violation_code column, the number of records with duplicated values drops to 0. Therefore, the duplicates are not due to an error. Removing these records is not the correct approach for reducing duplication. This example highlights the need for intimate domain knowledge to address duplicated values.

7. Detecting duplicated data

Now, consider a case where some duplicates do exist for the columns. Here, we have 3 records with the same camis, name, boro, inspection_date, and violation_code. Including the duplicate record would result in this record being over-represented in an analysis of, say, the number of violations occurring in Queens.

8. The ROW_NUMBER() function

PostgreSQL provides a function named ROW_NUMBER() that is useful for determining where groups of column values are duplicated in a dataset. An OVER clause must follow the ROW_NUMBER() function call. The OVER clause defines the window (or set of results) over which the function will be applied. Groups of repeated column values can be enumerated (starting at 1) by specifying the columns as a list following the PARTITION BY clause within the OVER clause. Including all 5 columns shown here under PARTITION BY, ROW_NUMBER() will assign the first duplicate group 1 and the second duplicate group 2. Using an optional ORDER BY clause within the OVER clause controls the order in which the numbers are assigned.

9. Enumerating duplicate rows

This query uses ROW_NUMBER() with the columns of interest specified in the PARTITION BY clause to create a column named "duplicate". The value in the column is the duplicate number that the record represents. Subtracting 1 from the value returned by ROW_NUMBER() assigns a 0 value to non-duplicated records.

10. Enumerating duplicate rows

Rows with duplicate values above 0 can be removed from the dataset.

11. Resolving impartial duplicates

One additional type of duplicate which you might encounter is what is called an "impartial" duplicate. Impartial duplicates arise when record values are duplicated for some columns while introducing ambiguity for column values in which they differ. Simply deleting duplicates, in this case, is not appropriate due to this ambiguity. Take the following data as an example. The first two records share all values except for the score column. Ideally, documentation containing the actual score earned for this inspection can resolve this ambiguity. If such a document is not available, a different approach is required.

12. Resolving impartial duplicates

A common way to resolve such ambiguous values is to replace the ambiguous values with a value computed from an aggregate function such as the average value. This query serves two purposes. First, the query identifies impartial duplicates. Second, the query provides the replacement value for resolving the ambiguity. The HAVING clause ensures that the query only returns records with duplicate values for the columns in the GROUP BY clause.

13. Resolving impartial duplicates

This change eliminates duplicate records and resolves the ambiguity with an updated score.

14. Let's practice!

Now, let's apply our knowledge of working with duplicated data.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.