Column filtering and aggregation
1. Column filtering and aggregation
How can we combine filtering and aggregation to clean and analyze related columns?2. Book sales dataset
We'll continue to use our dataset of bestselling books.3. Defining conditions for filtering
Data analysis and cleaning often requires filtering based on multiple conditions. Using Tablesaw's `Selection` class, we create filter conditions: `.isGreaterThan()` checks if sales exceed 70 million, `.isEqualTo()` finds English books, and another `.isGreaterThan()` identifies books published after 1950. Similar methods exist for `.isLessThan()`, `.isLessThanOrEqualTo()`, and `.isGreaterThanOrEqualTo()`. Next we'll see how to apply these conditions.4. Filtering based on conditions
Tablesaw provides three ways to filter data. The simplest is using `.where()` with a direct condition like `.isGreaterThan(70)`. For reusable conditions, we can create `Selection` variables first, then pass them to `.where()`. Most powerfully, we can combine multiple conditions using `.and()` - here finding modern, high-selling English books. `.sortDescendingOn()` sorts the results in descending order by "Sales_in_Millions".5. Filtering based on conditions: outputs
The output shows three books meeting all our criteria.6. Summarizing by mean
To find overall statistics, the `.summarize()` method calculates a single statistic across all rows. Here we find the mean sales across all books. First we specify the column ("Sales_in_Millions") and statistic (`mean`), then `.apply()` performs the calculation, showing average sales of about 73 million copies.7. Aggregating by groups
Aggregation helps us understand patterns across groups. After applying `.summarize()`, `.by()` organizes these calculations by category. Think of `.by()` as creating separate groups - in this case, one group for each language - and then calculating the mean sales for each group separately. It's like sorting books into language piles and finding the average sales for each pile.8. Aggregating by groups: outputs
Here we calculate average book sales for each language, revealing interesting market differences.9. Combining filtering and aggregation
Often we need to filter data before aggregating. Here we combine `.where()` to select modern books (after 1950) with `.summarize()` to calculate total sales by language and `.by()` to group results by language. The operations are chained: first filter, then calculate the sum, then group by language.10. Combining filtering and aggregation: outputs
The output shows average book sales for each language, after filtering for modern books published after 1950.11. Putting it all together
Let's review filtering and aggregation operations for analysis and cleaning. We start with filtering using `.where()` and conditions like `.isGreaterThan()`. Multiple conditions can be combined with `.and()`. For analysis, `.summarize()` calculates statistics while `.by()` groups data. These operations can be chained - first filtering data, then calculating aggregations by groups. This combination of filtering and aggregation enables powerful data analysis.12. Let's practice!
Now it's your turn to practice filtering and aggregating columns!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.