Calculating ranks and correlations
1. Calculating ranks and correlations
In this lesson, we'll rank venues by price, then use correlation to find users with similar tastes.2. Ranking
Ranking is useful for understanding position within a dataset. Let's say we want to rank these World Cup winners in descending order. Brazil is clearly number 1, but what happens when we have ties like Germany and Italy both at 4 wins?3. Ranking
One method for tie-breaking is to say they are both number 2.4. Ranking
Another method is to assign them both 2.5 - the average of 2nd and 3rd.5. Our venues data
Now let's rank venues by price using the venues DataFrame. We'll see how the rank expression works and how to apply different tie-breaking strategies.6. Rank venues by price
We start by adding a new column.7. Rank venues by price
And we use the rank expression on the price column.8. Rank venues by price
We use descending=False so the lowest price gets rank 1.9. Rank venues by price
And we name the new column rank_default.10. Ranked venues
With the default rank expression, equal values get the average rank. Here, the venues on the second and fourth rows have the same low price, so they are both ranked 1.5. But what if we prefer a leaderboard-style ranking where both are ranked first?11. Rank venues: add min ranking
We do this in a second rank column with method="min". Ties get their minimum rank, so both venues get rank 1. The next rank is 3.12. Which ranking style should we use?
Both ranking methods are valid. The default produces a floating-point value that is a more accurate measure of where a restaurant ranks compared to other venues. Min ranking produces integers and gives a cleaner leaderboard-style display.13. Correlating user reviews
Now let's move on to correlations - a way to measure similarity between columns. We want to find users with similar tastes so we can make recommendations. In our user reviews DataFrame, each column has a user's review scores. Alice and Charlie both prefer 7burgers and The Queens Head, while Bob prefers Costa Coffee. Correlation lets us quantify these similar and opposing tastes.14. Calculating correlations
To make the correlations DataFrame, we first select the user_reviews columns we want to correlate.15. Calculating correlations
We only want the integer columns with review scores, so we use pl.selectors.integer().16. Calculating correlations
Then we call .corr on the DataFrame. In the output, each cell is the correlation between a pair of users.17. Adding a user column
To make it clear which pair of users each cell refers to, we add a user column. Each user has a correlation of 1.0 with themselves. Alice and Charlie are highly correlated at 0.95 - if Alice likes a restaurant, we can recommend it to Charlie. Alice and Bob are negatively correlated at −0.91, meaning opposite tastes - we wouldn't pass on their recommendations to each other. A correlation close to 0 means no relationship between preferences.18. Quick summary with .describe()
Now, the remaining method we need to cover to quickly analyze data is describe(). Instead of the default percentiles, we want custom output that displays the 33rd and 67th percentiles we used for our categories.19. Quick summary with .describe()
We do this with the percentiles argument, passing a list with 0.33 and 0.67.20. Quick summary with .describe()
And this gives us the standard statistics from describe, but now with our preferred percentiles. Custom summaries let us tailor the output to our needs.21. Let's practice!
Now it's your turn to rank venues and calculate correlations.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.