Get Started

Two-sample proportion tests

1. Two-sample proportion tests

Great work so far! In the previous lesson, we tested a single proportion against a specific value. As with means, we can also test for differences between proportions in two populations.

2. Comparing two proportions

The Stack Overflow survey contains a hobbyist variable. The value "Yes" means the user described themselves as a hobbyist and "No" means they described themselves as a professional. We can hypothesize that the proportion of hobbyist users is the same for the under thirty age category as the thirty or over category, which is a two-tailed test. More formally, the null hypothesis is that the difference between the population parameters for each group is zero. Let's set a significance level of point-zero-five.

3. Calculating the z-score

Here is the z-score equation for a proportion test. Let's break it down. The sample statistic is the difference in the proportions for each category. That's the two p-hat values in the numerator. We subtract the hypothesized value of the population parameter, and assuming the null hypothesis is true, it's zero. The denominator is the standard error of the sample statistic. We can again avoid having to generate a bootstrap distribution to calculate the standard error by using a standard error equation, which is a slightly more complicated version of the one sample case. Note that p-hat is a weighted mean of the sample proportions for each category, also is known as a pooled estimate of the population proportion. p-hat can be calculated using the following equation. This looks horrendous, but Python is great at handling arithmetic. We now only need four numbers from the sample dataset to perform these calculations and calculate the z-score: the proportion of hobbyists in each age group, and the number of observations in each age group.

4. Getting the numbers for the z-score

To calculate these four numbers, we group by the age category, and calculate the sample proportions using dot-value_counts, and the row counts using dot-count. As we're looking at the proportion of hobbyists, we'll only be focusing on rows where hobbyist is Yes.

5. Getting the numbers for the z-score

To isolate the hobbyist proportions from p_hats, we can use pandas' multiIndex subsetting, passing a tuple of the outer column and inner column values. This returns a sample proportion of point-77 for the at least thirty group, and point-84 for the under thirty's.

6. Getting the numbers for the z-score

The number of observations in each age category can be extracted with simpler pandas subsetting. There are 1050 rows in the at least thirty group and 1211 for the under 30 group.

7. Getting the numbers for the z-score

After that, we can do the arithmetic using our equations for p_hat, the standard error, and the z-score to get the test statistic. This returns a z-score of minus four-point-two-two. Luckily, we can avoid much of this arithmetic.

8. Proportion tests using proportions_ztest()

The proportions_ztest function from statsmodels can calculate the z-score more directly. This function requires two objects as NumPy arrays: the number of hobbyists in each age group, and the total number of rows in each age group. We can get these numbers by grouping by age_cat, and calling dot-value_counts on the hobbyist column, as shown above. The numbers can then either be read-off or subsetted to create the arrays. Next, we import proportions_ztest from statsmodels-dot-stats-dot-proportions, and pass the arrays to the count and nobs arguments. Because we're testing for a difference, we specify that this is a two-sided test using the alternative argument. proportions_ztest returns a z-score and a p-value. The p-value is smaller than the five percent significance level we specified, so we can conclude that there is a difference in the proportion of hobbyists between the two age groups.

9. Let's practice!

Time to perform your own proportion tests.