Exploring grouped data

1. Exploring grouped data

We can get useful insights by grouping our data based on values in one or more of their columns.

2. Filtering for groups

Looking at the chocolates dataset, we might wonder how the location of the company or the cocoa content influences the ratings. We could filter our DataFrame and assign the result to new DataFrames for each company, or each company along with the bean origin country, and then look at the ratings in these DataFrames. However, we would quickly get too many DataFrames that we would have to create manually. So how can we do it better?

3. groupby() to the rescue!

We can use the group-by function! The group-by function takes a DataFrame and a column we want to filter on, usually a categorical variable, and returns a GroupedDataFrame.

4. GroupedDataFrames

GroupedDataFrame is a bit like a dictionary of DataFrames. It contains all the data from the original DataFrame, but, as the name suggests, grouped by the values from the column used for grouping. We can then access these individual groups by indexing the GroupedDataFrame.

5. groupby() in action

Let's see it in practice. To group the chocolates DataFrame by the company location column, we call groupby, passing chocolates, and the column of interest - company location. The result looks like this. We have a GroupedDataFrame object with 60 groups. The first group contains rows where the company location is France and has 156 rows. The last group contains four companies based in Ireland.

6. groupby() on multiple columns

We can also group by more than one column. In that case, we pass the columns as a vector.

7. Order matters

We must be careful though - the order of the columns can matter. Here is a schema of grouping the chocolates DataFrame by first company location, then cocoa, and by first cocoa and then company location.

8. Number of records

To find out the number of records in each group, we can use the combine function on the GroupedDataFrame. The result of the combine function is a new DataFrame containing values for the column we grouped on, in this case, the company location, and a column with the number of rows, which represents the number of companies in each location.

9. Sort by number of records

As we can see, the results are given in the order of the groups, and they are not sorted. To sort them, we can use the sort function, which we know from earlier courses. If the rev keyword is set to true, we sort in descending order.

10. unique() on a vector

And what if we want to know in advance what groups we'll get? We can use the unique function. To use it on a column we call unique and pass the company location column of chocolates DataFrame. We get a 60 element vector containing all the unique entries from that column.

11. unique() on a DataFrame

However, we can also use it on a DataFrame. In that case, the call of unique on chocolates returns a new DataFrame, containing only those rows from the original DataFrame which are unique. If there are duplicate rows that have the same values for each column in the original DataFrame, then only the first occurrence is kept.

12. unique() with specified columns

Lastly, we can specify a column or columns when we use unique. In that case, the result contains only those rows where the values in these specified columns are unique. Note that there are about twice as many unique combinations of companies and cocoa contents as opposed to just companies. Overall, using the unique function can help us drop duplicate rows that might occur in real datasets.

13. Let's practice!

Are you ready to group? Let's head to the exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.