Get startedGet started for free

Report duplication

1. Report duplication

Data can get duplicated rather easily. This is the most dangerous type of messy data, as it can make your report show incorrect numbers and cause your business to make bad decisions.

2. What causes duplication?

Data duplication occurs most commonly when joining datasets of different granularity or scope. Here’s an example. The points table here has only one row for each id. The matches table, on the other hand, has three rows for each id.

3. What causes duplication?

This means that one row from the points table will join to three rows on the matches table. As a result, running a join that sums up the points field

4. What causes duplication?

5. What causes duplication?

This is because of the order of operations SQL takes when running a join. SQL first runs the join row-by-row and effectively creates an intermediate table, as shown. SQL then aggregates the data on this intermediate table.

6. What causes duplication?

Since the points field is shown three times, we get duplication when we sum it.

7. Ways to fix duplication

There are a few ways to fix duplication. The simplest way is to simply remove the aggregation causing the duplication. In our example, removing the SUM function from points gives us the correct results. This works in our case because the points table only has 1 value for each id, so no aggregation is technically needed.

8. Ways to fix duplication

But this approach only works if you have 1 and only 1 value for each join key. Let's say we had another field in the points table for year. In this case, we can't pull an unaggregated version of points, as there are now 2 rows for each id in the points table. We must aggregate it.

9. Ways to fix duplication

A join in this example will still give us duplication, as shown. Another way to solve duplication is to add another field to the JOIN statement. In this example, we can add the year field.

10. Ways to fix duplication

Now that we require two fields to join, each row in the points table maps to one and only one row in the matches table, which erases any duplication we previously had.

11. Ways to fix duplication

Another option to use is to roll-up the tables into the same granularity before any joining takes place. For example, we could roll-up our matches table to the id level as a first step, as so.

12. Ways to fix duplication

Now that the query only has 1 row for each id, we can use this query as a subquery and join it to the points table with 1-to-1 mapping. Thus, a third approach to fixing duplication is to roll-up tables using a subquery. You can see the full query here.

13. Ways to fix duplication

To recap, three ways to fix duplication are to remove aggregations, add fields to the join statement, or to rollup tables using a subquery.

14. Identifying duplication

To use one of these approaches, though, you need to realize duplication exists in the first place. I highly suggest getting used to validating your query using the approach mentioned in chapter 1. In this approach, you check the total value in the original table and compare it to the total value in your query. If the result of your query is higher than the original table, then your report has duplication.

15. Chapter goal

So that's a summary of what duplication is, how to identify it, and how to fix it. Next up is a series of exercises, the last of which will require you to clean data in several ways to build this element for your dashboard.

16. Practice time!

Before we do that, though, let's get some practice dealing with duplication.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.