Get startedGet started for free

How to use the .groupby() method on a DataFrame?

1. How to use the .groupby() method on a DataFrame?

Good work with the .apply() method! Now we'll cover another common method used with DataFrames - .groupby(). Let's get started!

2. Dataset

We'll refer to the dataset on the relationship between personal background factors and the concentrations of plasma B-carotene and plasma retinol in blood. Low concentrations of these compounds have been suggested to be associated with a higher risk of cancer.

3. .groupby()

The .groupby() method groups the data according to some criteria. We can then perform an operation on each group. The most common way is to group the data by a factor specified by a column name. For example, here we split by one factor, gender, and here - by two, gender and smoking. In both cases the output is a special DataFrameGroupBy object.

4. Iterating through .groupby() output

It's possible to iterate through this object. Each item is a tuple with the first element being a grouping factor and the second - the corresponding DataFrame.

5. Iterating through .groupby() output

More grouping factors imply more DataFrames. Here, we get as many DataFrames as there are gender / smoking combinations.

6. Standard operations on groups

There are many cool things we can do with groups! For example, we already know that DataFrames and Series provide many standard methods to use. We can select a column and apply a method of interest. For example, .mean() or .count(). We can use the same functionality for groups! Here is the mean for each group. And here is the count of valid values for each group. Actually, almost all the DataFrame or Series methods can be applied to DataFrameGroupBy objects.

7. The .agg() method

For example, let's recall the .agg() method. It's almost identical to the .apply() method we talked before. By default, it applies a function to each specified column that summarizes it with a single value. For example, to calculate the mean value of the plasma retinol level, we can pass the NumPy mean() function to the method.

8. The .agg() method

Here is the result if we have two columns of interest.

9. The .agg() method

The big difference to the .apply() method is that we can specify several aggregating functions in a list.

10. .groupby() followed by .agg()

As you might have guessed, the .agg() method can be successfully used for DataFrameGroupBy objects.

11. Own functions and lambda expressions

We can, of course, create our own functions. For example, let's count the number of values in a column exceeding the mean value. Here's the corresponding output.

12. Own functions and lambda expressions

We can also insert lambda expressions in the .agg() method. Here, we simply calculate the size of the column in a group.

13. Renaming the output

If we use a dictionary instead of a list with functions, the keys will be used as column names.

14. The .transform() method

Another useful DataFrame method is .transform(). It's also almost identical to the .apply() method we already discussed. It modifies the values in each given column by some rule specified in a function. For example, let's have a function that centers and scales the data in a column.

15. DataFrame and the .transform() method

Here's the modified DataFrame after applying the .transform() method with our function on two columns.

16. .groupby() followed by .transform()

If we apply the .transform() method on groups, the output will be different because we modify the columns in each group separately. Afterwards, the transformed data is merged into a single DataFrame.

17. .groupby() followed by .transform()

Of course, instead of a well-defined function, we can also use a lambda expression.

18. The .filter() method of DataFrameGroupBy object

The last method we discuss is the .filter() method. It filters out groups according to the logical output of the passed function and merges the remaining groups into a new DataFrame. Notice that a function acts on the whole DataFrame in each group. Therefore, we can specify quite complex filters.

19. .groupby() followed by .filter()

For example, when we group here, we get 6 groups. Let's have a function that checks if the mean BMI value is higher than 26. When we use it inside .filter(), we get the filtered DataFrame.

20. .groupby() followed by .filter()

We can check how many groups were filtered out by grouping the filtered data again. Now we have only 3 groups instead of 6.

21. Let's practice!

That's it on the .groupby() method. Let's have some practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.