1. Does gender affect whose vehicle is searched?
During a traffic stop, the police officer sometimes conducts a search of the vehicle. Does the driver's gender affect whether their vehicle is searched? Let's review a few pandas techniques that will help us to answer this question.
2. Math with Boolean values
Recall that you can perform mathematical operations on Boolean values. For example, you previously used the isnull() method to generate a DataFrame of True and False values, and then took the sum() to count the missing values in each column. This worked because True values were treated as ones and False values were treated as zeros.
Now we'll use the NumPy library to demonstrate a different operation, namely the mean. If you take the mean() of the list 0 1 0 0 you'll get 0.25, calculated as 1 divided by 4. Similarly, if you take the mean() of the list False True False False, you'll also get 0.25. Thus, the mean of a Boolean Series represents the percentage of values that are True.
3. Taking the mean of a Boolean Series
Now, let's see a real example of why it's useful to be able to take the mean of a Boolean Series.
We'll first calculate the percentage of stops that result in an arrest using the value_counts() method. The arrest rate is around 3.6% since that's the percentage of True values. Note that this would work on an object column or a Boolean column.
But we can get the same result more easily by taking the mean() of the is_arrested Series. This method only works because the data type is Boolean. This is exactly why you changed the data type of this Series from object to Boolean back in the first chapter.
4. Comparing groups using groupby (1)
The second technique we'll review is groupby(), which you've used in previous courses.
Let's pretend that you wanted to study the arrest rate by police district. You can see that there are six districts by using the Series method unique().
One approach we've used to compare groups is to filter the DataFrame by each group, and then perform a calculation on each subset. So to calculate the arrest rate in Zone K1, we would filter by that district, select the is_arrested column, and then take the mean(). The arrest rate is about 2.4%, which is lower than the overall arrest rate of 3.6%.
5. Comparing groups using groupby (2)
Next we calculate the arrest rate in Zone K2, which is about 3.1%. But rather than repeating this process for all six districts, we can instead group by the district column, which will perform the same calculation for all districts at once. You can see a noticeably higher arrest rate in Zone X4.
6. Grouping by multiple categories
You can also group by multiple categories at once. For example, you can group by district and gender by passing it as a list of strings. This computes the arrest rate for every combination of district and gender. In other words, you can see the arrest rate for males and females in each district separately.
Note that if you reverse the ordering of the items in the list, grouping first by gender and then by district, the calculations will be the same but the presentation of the results will be different. You can use whichever option makes it easier for you to understand the results.
7. Let's practice!
Now it's time to practice using these techniques to investigate the relationship between driver gender and vehicle searches.