1. Joining on key columns
In this lesson we'll look further at how to complete a one-to-many join based on key columns.
2. Framework (continued)
Let's continue with our framework. Once we’ve determined the relationship between data frames,
the next step is to check each table for unique values in the key column. We do this to make sure we're merging on the right level.
3. Unique key columns
In tables with single column keys,
you can check the data frame for unique values within the key column itself. This statement uses the 'duplicated' method to identify duplicate rows for the key column. We add the 'sum' method to count the number of duplicates.
The returned value indicates the number of unique rows that are duplicated. If all goes well, you'll see a zero.
In tables with multi-column keys, we are looking for unique combinations of values across the key columns.
A nice feature with Pandas is there's no need to create a concatenated column.
We can again use the 'duplicated' method, this time with the key columns specified in a bracketed list.
4. Framework (continued)
Now for the fun part - writing the merge statement and executing the code.
5. Executing the merge
Guess what?
The merge statement for a one-to-many join is no different than what you saw earlier in a one-to-one join! In practice, it’s important to pay extra attention to the parameters contained in the merge statement. Left, right, inner and outer joins can result in very different data frames after a one-to-many merge.
Are you ready for what’s next? This is the full syntax for a merge statement. The optional parameters you see here provide a wide range of flexibility for dealing with different situations. We've used some of these options already and will use others as you progress through the course.
6. Validating merges
Speaking of parameters, let's look at the last parameter in the expression, 'validate'. It checks if the merge is of the specified type.
You can specify one of four types.
For one-to-one, it checks if merge keys are unique in both left and right datasets.
For one-to-many, it checks if merge keys are unique in the left dataset.
For many-to-one, it checks if merge keys are unique in the right dataset.
If the merge fails these checks, it will result in an error telling you the keys are not unique as specified.
You can also specify many-to-many, but this does not result in a check. It's simply available in case you want to write consistent code.
7. Let's practice!
OK then, enough theory. Let's practice merging data frames!