1. Joining data in real life
In this 4th and final chapter we’ll cover advanced joins. We’ll start with situations you will likely meet when dealing with real data.
2. Mixing columns and indexes
Data is often organized in different ways and comes in different formats. In this example we have two tables showing data by unique player ids. However, the left-hand table has player id as a data column while the right-hand table uses player_id as its index.
How do you join these data frames? Well, you could reindex the data frames before merging. But there's a more direct way.
You can use teams dot merge and specify ‘left_on equals player-id’ along with the argument ‘right-index = true’.
You could also merge the dataframes if left and right were reversed.
In that case you would start with ‘positions dot merge, teams’, then follow with ‘left_index equals true’ and ‘right_on equals player id’.
The best choice depends on your overall plan for joining the data and the desired result.
3. Working with overlapping column names
What about columns with the same name that actually refer to different features? In this example we see both data frames show teams for players. However, the context differs. One table refers to the player's current team and the other refers to the team that first drafted the player into professional football.
Pandas will automatically resolve the conflict by naming the columns team-underscore-x and team-underscore-y. But that's not very descriptive and may cause confusion.
You can use the 'suffixes' parameter to make it clearer. All you need to do is specify 'suffixes-equals', followed by the suffixes for both frames enclosed in parentheses.
By the way, two values inside parentheses, separated by a comma, are called a tuple in Python. So, congratulations - you just used a tuple!
4. Identifying rows by table
You might find yourself at times wanting to show which input data frame contributed rows of data to the merged result. This information can be especially useful for outer merges that include all rows from each table.
In this case you can use ‘indicator = true’ to show where each row of data started.
The resulting data frame will have a new column, ‘underscore merge’, that shows ‘left only’, ‘right only’, or ‘both’.
In the example shown here, most players appear in only the left or right table. David Johnson is the exception and appears in both tables.
5. Sorting rows by key
Finally, you might find it helpful to sort data after joining. If you want to sort by the key column used to join the data frames,
just add ‘sort=True’ to the merge statement. The resulting data frame will appear in increasing order by the key.
Pandas also has a ‘sort-values’ method you can apply later, but including the ‘sort’ parameter in the merge statement is an efficient way to take care of it during the join.
6. Let's practice!
OK, now that you’ve expanded your knowledge of merging data in real-life, it’s time to practice.