Calculating ranks and correlations

1. Calculating ranks and correlations

In this lesson, we'll rank venues by price, then use correlation to find users with similar tastes.

2. Ranking

Ranking is useful for understanding position within a dataset. Let's say we want to rank these World Cup winners in descending order. Brazil is clearly number 1, but what happens when we have ties like Germany and Italy both at 4 wins?

3. Ranking

One method for tie-breaking is to say they are both number 2.

4. Ranking

Another method is to assign them both 2.5 - the average of 2nd and 3rd.

5. Our venues data

Now let's rank venues by price using the venues DataFrame. We'll see how the rank expression works and how to apply different tie-breaking strategies.

6. Rank venues by price

We start by adding a new column.

7. Rank venues by price

And we use the rank expression on the price column.

8. Rank venues by price

We use descending=False so the lowest price gets rank 1.

9. Rank venues by price

And we name the new column rank_default.

10. Ranked venues

With the default rank expression, equal values get the average rank. Here, the venues on the second and fourth rows have the same low price, so they are both ranked 1.5. But what if we prefer a leaderboard-style ranking where both are ranked first?

11. Rank venues: add min ranking

We do this in a second rank column with method="min". Ties get their minimum rank, so both venues get rank 1. The next rank is 3.

12. Which ranking style should we use?

Both ranking methods are valid. The default produces a floating-point value that is a more accurate measure of where a restaurant ranks compared to other venues. Min ranking produces integers and gives a cleaner leaderboard-style display.

13. Correlating user reviews

Now let's move on to correlations - a way to measure similarity between columns. We want to find users with similar tastes so we can make recommendations. In our user reviews DataFrame, each column has a user's review scores. Alice and Charlie both prefer 7burgers and The Queens Head, while Bob prefers Costa Coffee. Correlation lets us quantify these similar and opposing tastes.

14. Calculating correlations

To make the correlations DataFrame, we first select the user_reviews columns we want to correlate.

15. Calculating correlations

We only want the integer columns with review scores, so we use pl.selectors.integer().

16. Calculating correlations

Then we call .corr on the DataFrame. In the output, each cell is the correlation between a pair of users.

17. Adding a user column

To make it clear which pair of users each cell refers to, we add a user column. Each user has a correlation of 1.0 with themselves. Alice and Charlie are highly correlated at 0.95 - if Alice likes a restaurant, we can recommend it to Charlie. Alice and Bob are negatively correlated at −0.91, meaning opposite tastes - we wouldn't pass on their recommendations to each other. A correlation close to 0 means no relationship between preferences.

18. Quick summary with .describe()

Now, the remaining method we need to cover to quickly analyze data is describe(). Instead of the default percentiles, we want custom output that displays the 33rd and 67th percentiles we used for our categories.

19. Quick summary with .describe()

We do this with the percentiles argument, passing a list with 0.33 and 0.67.

20. Quick summary with .describe()

And this gives us the standard statistics from describe, but now with our preferred percentiles. Custom summaries let us tailor the output to our needs.

21. Let's practice!

Now it's your turn to rank venues and calculate correlations.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.