1. Grouping data by category in pandas
When using categorical variables in pandas, there are a few essential methods used for exploring data that every pandas user must master. We will cover one of these methods now.
2. The basics of .groupby(): splitting data
The groupby method splits data across the unique values of the column specified. One of the most common forms of data analysis is to split data across groups and to perform analysis on each group. The groupby process is essentially equal to creating multiple DataFrames, one for each value in the specified variable.
In the adult income dataset, the Above/Below 50k variable has two categories. This data can be split into two DataFrames using two separate filters.
However, this can be replaced by a one liner using groupby.
The first parameter in the groupby method is by - which is used to specify the variable or variables to split the data by. This does not have to be a list, but we will use a list throughout for consistency.
3. The basics of .groupby(): apply a function
After calling groupby, you can specify a function that should be applied to the split data. Common functions such as sum, count, and mean can be used, but custom functions can be applied as well.
When using numerical functions, such as mean, the function will only be applied to the numerical columns.
Notice that the mean of each numerical column in the adult dataset was calculated across the two groups of the Above/Below 50k column.
Just as a quick note, you don't have to create a groupby object to run a function. You can chain the creation and the function call in a one liner.
4. Specifying columns
When using large datasets, it is not always possible to apply a groupby call to all columns. It is important to specify the columns of interest before calling the function to apply.
Consider this group by call. We group by the above-below 50k column, subset to just the age and education number columns, and then calculate the sum of these two columns.
Alternatively, we get the same result if we group by above-below 50k, call the sum function, and then subset the results to just the age and education number values.
Option 1 tends to be much faster, as it will not perform the sum calculation on the other numerical columns. As your datasets get larger, it is important to be careful with running the groupby method.
5. Groupby multiple columns
The groupby method can be called on more than one variable.
By specifying two columns with the by parameter, the groupby method will create subsets of the data for all combinations of the variables specified. Here we are using the size function to check how many rows of the data fall into each grouping.
The variable Above/Below 50k has two categories, while Marital Status has seven. This creates 14 different groupings. When calling groupby on multiple variables, it is important to check the size of each combination to make sure there are enough rows per combination to do analysis. For example, the combination of more than 50k and Married-AF-spouse only has 10 rows of data, which is not very much. If no rows exist for a combination of the variables, it will not be added to the groupby object.
6. Practice using .groupby()
Let's work through a few examples of using the pandas groupby method.