Get startedGet started for free

Data filtration using the filter() function

1. Data filtration using the filter() function

In this last lesson, we will discuss how we can use the filter() function on a grouped pandas object. This allows us to include only a subset of those groups, based on some specific conditions.

2. Purpose of filter()

Often, after grouping the entries of a DataFrame according to a specific feature, we are interested in including only a subset of those groups, based on some conditions. Some examples of filtration conditions are the number of missing values, the mean of a specific feature, or the number of occurrences of the group in the dataset.

3. Filter using groupby().filter()

We are interested in finding the mean amount of tips given, in the days when the mean amount paid to the waiter is more than 20 USD. The .filter() function accepts a lambda function that operates on a DataFrame of each of the groups. In this example, the lambda function selects "total_bill" and checks that the mean() is greater than 20. If that lambda function returns True, then the mean() of the tip is calculated. If we compare to the total mean of the tips, we can see that there is a difference between the two values, meaning that the filtering was performed correctly.

4. Comparison with native methods

If we attempt to perform this operation without using groupby(), we end up with this inefficient code. At first, we use a list comprehension to extract the entries of the DataFrame that refer to days that have a mean meal greater than $20, and then use a for loop to append them into a list and calculate the mean. It might seem very intuitive, but as we see, it's also very inefficient.

5. Let's do it!

Now it's your turn to filter!