1. Complex keys
Welcome back. In this lesson, you'll learn how to diagnose misspecified joins, and how to use more complex join keys.
2. Misspecified joins
A misspecified join is one where you don't use the correct join keys. This can lead to a variety of errors and malformed data tables.
All of these are good indicators you need to go back and double-check your join keys.
3. Column type mismatch
If you see an error message with the word "bmerge" and "typeof" it means that the join key columns you've specified have different types, so they can't be matched.
You might be using the wrong columns,
4. Column type mismatch
Or the columns may have been loaded in as different types and need converting so that both have the same type.
5. Malformed full joins - no common key values
If the columns have the same type, then the output and difficulty of diagnosing the problem will depend on the type of join, and whether the columns coincidentally have any values in common.
If the columns have no values in common then a full join will lead to the two data tables being stacked, and filled with missing values.
6. Malformed right and left joins - no common key values
A right or left join will just return the rows from one of the data tables, with columns from the other filled entirely with missing values.
7. Malformed inner joins - no common key values
And an inner join will have zero rows.
8. Malformed joins - coincidental common key values
If you use the wrong key, but the columns coincidentally share some common values, the problem can be much harder to pick up. So its important to check that the result of your join makes sense.
9. Avoiding misspecified joins
The best way to avoid misspecified joins is to take the time to learn about the information each column contains so that you can critically reason about how to match rows before attempting a join.
There are several factors that can make identifying join keys more difficult than the examples you've encountered in previous lessons.
10. Keys with different column names
Sometimes the join keys might have different column names in each data table.
In these cases, you use the by dot x and by dot y arguments in the merge() function,
and in the data table join syntax you use an equals operator in the on argument, or in the character vector supplied to the on argument.
11. Multi-column keys
You may also need multiple columns to match rows between data tables.
This can be because the information that uniquely identifies rows is split across multiple columns,
12. Multi-column keys
or because there are multiple entries for each entity.
13. Specifying multiple keys with merge()
In the merge() function, multiple key columns are supplied as a character vector to the by argument.
If one or more of these keys has different column names in both data tables, you will need to supply a character vector to both the by dot x and by dot y arguments. Join key column names are matched in the order they appear in each character vector.
14. Specifying multiple keys with the data.table syntax
When using the data table syntax, multiple keys can be supplied separated by commas.
Columns that have different names are specified using an equals sign.
15. Final Slide
Now, its your turn to practice with complex keys.