Get startedGet started for free

Duplicate matches

1. Duplicate matches

The last set of join problems you will commonly encounter arise from duplicate matches in the join key values.

2. Join key duplicates

Sometimes you will want to join two data tables based on columns with duplicate entries. In the example shown on the slide, we are joining two data tables containing data we've measured on the microbial content of soil at two different farm sites. We've used multiple methods to try to quantify whether a bacterial genus is present, and if so, by how much? We then might want to join these two data tables to get a list of bacteria that could be found at both sites using any method, so we join the two data tables using the genus column as the join key. This leads to multiplicative matches. Each row in the site2_ecology data table will match two rows in the site1 ecology_data table, leading to 12 matches all up, six for each bacteria shown.

3. Error from multiplicative matches

If you try to run this code, you will get a long error as shown here warning you that the result will have more rows than both data tables combined. This error is shown to protect you from making unexpectedly large data tables by accident because multiplicative matches are a common problem in misspecified joins, which you learnt about in the first lesson of this chapter.

4. Allowing multiplicative matches

When you do a join like this intentionally, you can set allow dot cartesian equals TRUE to let the join proceed. This argument is found in both the data table syntax and the merge() function.

5. Allowing multiplicative matches

With data table syntax, you will get the following data table as a result

6. Missing values

If you have missing values in your join key column, then these can also lead to multiplicative matches, because they will match all other missing values in the join.

7. Filtering missing values

To remove these matches, which are likely meaningless, you can filter out rows with missing values in the join key columns using the is dot na function along with not operator.

8. Keeping only the first match

In some cases, you may want to join two data tables where each value in the join key column has multiple matches, but you only want to keep the first match. For example, we might have multiple collections at site 1, but when matching to the site2_ecology data table we only want to match the most recent collection. To do this you can set mult equals first to keep only the first matching row in the main data table on the left side of the join.

9. Keeping only the last match

Conversely, you can set mult equals last to keep only the last matching row.

10. Identifying and removing duplicates

You can also use the duplicated() and unique() functions to identify and remove duplicates.

11. The duplicated() function

The duplicated() function from the the data table package will tell you which rows are duplicates. It returns a vector of TRUE and FALSE values, where each entry corresponds to a row in the data table. A value of TRUE means that row has the same values as one of the previous rows in the data table. An additional argument, "by", can be used to restrict those checks to only a subset of columns, for example looking for duplicates in just the join key column.

12. The unique() function

The unique() function takes the same arguments, but instead returns the data table without the rows that were determined to be duplicates.

13. Changing the search order

You can also change the direction the duplicated() and unique() functions search for and remove duplicates by specifying fromLast equals TRUE.

14. Let's practice!

Now it's your turn to practice handling duplicate matches.