1. Managing duplicates
A common problem that SQL practitioners come across is duplicate rows in the results of a query. Duplicate rows can be the result of a poor database design, a poorly designed query, or both. In any case we may need a mechanism to remove duplicates. In this lesson, we'll look at the roles of DISTINCT() and UNION in removing duplicates and their potential effects on performance.
2. Query returning duplicates
If we run a simple query on the PlayerStats table that returns the name of each player that participated in the 2017-2018 NBA season, there are 605 rows returned. But there is a problem. Team is also a column in the table and players could’ve played for multiple teams that season, so our simple query produces duplicate player name rows.
3. Removing duplicates with DISTINCT()
We can remove duplicate rows by enclosing the PlayerName column in the DISTINCT() clause after SELECT.
4. GROUP BY instead of DISTINCT()
But could we have just used GROUP BY instead? Yes, And there would not be much difference in the time it takes for either query to run. If we want to remove duplicates and return unique rows, then use DISTINCT().
5. GROUP BY instead of DISTINCT()
However, if we're going to apply an aggregate function, for example, we want the count of teams each player played for, we’ll need to use GROUP BY.
6. Is there another way?
Often there is an easier way to get the results we want. In this case we wanted to know the players that participated in the 2017-2018 NBA season. By looking at the database, we see that we could’ve simply queried the Players table, where PlayerName is the unique key, so there are no duplicate rows.
7. Fruits tables
We can also produce duplicates when we append queries together to form one data set. Appending is different from a join because when we append we are combining the query rows, rather than joining on key columns. Appending rows together may create the issue of duplicates as we can see in this example.
8. Duplicate fruits
The rows for Strawberry and Orange duplicate when we append.
9. Remove duplicates with UNION
In SQL we use the UNION operator for appending. UNION will do the work of checking and removing duplicate rows for us.
10. What about UNION ALL?
In some cases, we may require the returned results to contain duplicate rows, or we may know that there will be no duplicate rows when we append queries. In this case, we can use UNION ALL which returns all rows regardless if there are duplicates or not. Interrogating the data before creating your query will give you an idea of whether UNION or UNION ALL is the appropriate operator to use for appending queries.
11. DISTINCT() and UNION
Using DISTINCT() and UNION in queries, to remove duplicate rows, should be used with caution. Depending on the design of the database, both may use an internal sort mechanism to order and check for duplicates. This sorting has the potential to increase the time it takes a query to run significantly.
12. Using DISTINCT()
When using DISTINCT(), check if there is an alternative way to achieve the results we want, for example, there may be another table, with the column we require, as the unique key. Or, use GROUP BY if the query is using an aggregate function.
13. Using UNION
When appending data, we need to ask ourselves if duplicates are OK or whether the appending queries will not produce duplicates. In these cases, consider using UNION ALL which does not make use of the internal sort.
We’ll look at this internal sorting and how it affects performance later in this course.
14. Let's practice!
Let's practice managing duplicates.