Filtering joins

1. Filtering joins

Welcome to the third chapter! In this lesson, we will discuss a type of join called a filtering join. pandas doesn't provide direct support for filtering joins, but we will learn how to replicate them.

2. Mutating versus filtering joins

So far, we have only worked with mutating joins, which combines data from two tables. However, filtering joins filter observations from one table based on whether or not they match an observation in another table.

3. What is a semi join?

Let's start with a semi join. A semi join filters the left table down to those observations that have a match in the right table. It is similar to an inner join where only the intersection between the tables is returned, but unlike an inner join, only the columns from the left table are shown. Finally, no duplicate rows from the left table are returned, even if there is a one-to-many relationship. Let's look at an example.

4. Musical dataset

For this chapter, the dataset we will use is from an online music streaming service.

5. Example datasets

In this new dataset, we have a table of song genres shown here. There's also a table of top-rated song tracks. The 'gid' column connects the two tables. Let's say we want to find what genres appear in our table of top songs. A semi join would return only the columns from the genre table and not the tracks.

6. Step 1 - semi join

First, let's merge the two tables with an inner join. We also print the first few rows of the genres_tracks variable. Since this is an inner join, the returned 'gid' column holds only values where both tables matched.

7. Step 2 - semi join

For the next step in the technique, let's focus on this line of code. It uses a method called isin(), which compares every 'gid' in the genres table to the 'gid' in the genres_tracks table. This will tell us if our genre appears in our merged genres_tracks table.

8. Step 2 - semi join

This line of code returns a Boolean Series of true or false values.

9. Step 3 - semi join

To combine everything, we use that line of code to subset the genres table. The results are saved to top_genres and we print a few rows. We've completed a semi join. These are rows in the genre table that are also found in the top_tracks table. This is called a filtering join because we've filtered the genres table by what's in the top_tracks table.

10. What is an anti join?

Now let's talk about anti joins. An anti join returns the observations in the left table that do not have a matching observation in the right table. It also only returns the columns from the left table. Now, let's go back to our example. Instead of finding which genres are in the table of top tracks, let's now find which genres are not with an anti join.

11. Step 1 - anti join

The first step is to use a left join returning all of the rows from the left table. Here we'll use the indicator argument and set it to True. With indicator set to True, the merge method adds a column called "_merge" to the output. This column tells the source of each row. For example, the first four rows found a match in both tables, whereas the last can only be found in the left table.

12. Step 2 - anti join

Next, we use the "loc" accessor and "_merge" column to select the rows that only appeared in the left table and return only the "gid" column from the genres_tracks table. We now have a list of gids not in the tracks table.

13. Step 3 - anti join

In our final step we use the isin() method to filter for the rows with gids in our gid_list. Our output shows those genres not in the tracks table.

14. Let's practice!

Now, your turn.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.