Get startedGet started for free

Joining DataFrames by condition or nearest match

1. Joining DataFrames by condition or nearest match

We've seen how to join DataFrames on matching values. Now let's explore more flexible joins - matching by conditions and nearest values.

2. Joining on a condition

A standard join matches on equal values. But sometimes we need to match on a condition like "greater than or equal to." Consider this example with the World Cup winners.

3. Joining on a condition

And a table with the minimum wins needed to reach a level.

4. Joining on a condition

A conditional join keeps every pair where wins is greater than or equal to min_wins. We get 4 rows.

5. Joining on a condition

Because Brazil meets both gold and silver thresholds.

6. App users with preferences

Now we use a conditional join to match users with restaurants. Our restaurants DataFrame includes each venue's type and price, from the cheap cafes to the expensive bars.

7. The users table

Our users DataFrame has a budget in pounds and a preferred venue type for each user.

8. Joining on a condition

To match users and restaurants, we call join_where on the restaurants DataFrame.

9. Joining on a condition

And pass the users DataFrame.

10. Joining on a condition

We pass the join condition as an expression. We want the restaurant type to match the user's preferred type. When names clash, Polars renames the second table's column with _right, so type_right is type from the users table.

11. Joining on a condition

This returns 4 matching pairs, where each restaurant matches the user's preferred type. Bob matches both Costa Coffee locations because both are cafés.

12. Joining on multiple conditions

We can pass multiple conditions separated by commas. Here, we also require the price to be within the user's budget. Now we get only 3 matches. Bob can afford both Costa Coffee locations, while Alice can afford 7burgers. Charlie has no match because The Queens Head exceeds his budget.

13. Joining on most recent values

Now let's turn to nearest-value joins. We want to understand if changes in inspection scores affect review scores. On the left, we have inspection records with dates for 7burgers. On the right, we have user reviews for this restaurant. However, inspections and reviews occur on different dates, so we need to match each review to the most relevant inspection.

14. Joining on most recent values

We first sort reviews by date and then call join_asof. Think of join_asof as meaning "join as of" a certain date.

15. Joining on most recent values

We pass inspections, also sorted by date.

16. Joining on most recent values

We set on to date - that's our matching column.

17. Joining on most recent values

Using the default backward strategy, we match the most recent inspection before each review. Bob reviewed 7burgers in August and gets the June inspection.

18. Joining on nearest values

Alternatively, we can match to the closest inspection. With the strategy set to nearest, Bob now gets the September inspection with a rating of 4, instead of the June inspection with a rating of 3.

19. Choosing a strategy

Overall, if we have one DataFrame with data at one and seven minutes past midnight, and another with data every 10 minutes

20. Choosing a strategy - backwards

Then backward matches them both with the first row.

21. Choosing a strategy - nearest

Nearest matches to the closest row.

22. Choosing a strategy - forwards

And forward matches both to the second row.

23. Matching within groups

We now want to match reviews to inspections across all businesses. We do this by adding by=["business", "location"], so Polars matches within each venue before matching dates. Now we have all businesses, with nulls where no inspection occurred before the review date.

24. Let's practice!

Now you know join_where for conditions and join_asof for nearest matching. Time to practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.