Get startedGet started for free

Stratified Random Sampling

1. Stratified Random Sampling

Welcome back!

2. What is stratified random sampling

Sometimes, a sample derived from simple random sampling doesn't fit what we know about the wider population. If the real world general population is 50% male and 50% female, but our sample data is 60% male, 40% female, our data collection method would be flawed, and any analysis based on it would be biased and wouldn't reflect real world data. Stratified sampling is meant to better reflect the population. Stratified sampling is a technique in which a population is divided into discrete units called strata, based on similar attributes. It involves re-sampling the sample data so that the proportions match the population proportions.

3. Why use stratified random sampling?

This method minimizes selection bias and ensures that the true underlying population structure is represented. This method is efficient because a population being studied may be too large to be analyzed individually, so they are organized into groups with the same features to save costs and time. Stratified random sampling can be used when estimating income for varying populations, estimating polling elections, or estimating life expectancy.

4. When not to use stratified random sampling

Although this method works for populations that can be stratified using relevant attributes, we have to ensure that the groups do not overlap. Subjects that fall into multiple groups have a higher likelihood of being chosen and may cause misrepresentation in the sample. For example, if a survey question asked respondents, "How long have you worked at your current job?", and choices included one to two years and two to four years, there is an overlap as those who have been at their job exactly two years could choose either choice. We would have to make sure that the survey choices are mutually exclusive before performing the stratified random sampling technique.

5. Onsite work survey results at firm ABC

Let's look at the survey data sent out to employees of firm ABC. We have selected a random sample of 100 out 1000 employees who showed interest in on-site work. Suppose we want the proportion of employees for on-site work to reflect the gender distribution of the firm's population.

6. Check proportions on population

To assess how gender is distributed at firm ABC, we use the pandas method value_counts on the gender column, and to ensure our numbers are converted to proportions, we use the normalize equals True parameter. From the results, we expect a sample from this dataset to have a similar distribution of approximately 55-point-6% female and 44-point-4% male.

7. Plotting proportions on population

Let's create a pie chart visualizing the gender ratio. First, we'll import the matplotlib dot pyplot library, then specify the gender column of the survey population, followed by the value_counts method, followed by dot plot dot pie. Our population has slightly more females that males in the company.

8. Stratified sampling example

We will implement stratified sampling using the pandas groupby and apply methods. We first use groupby to split the dataset into two groups, male and female. Within groupby, we set the group_keys parameter to False to drop the extra index. Now, because we want 100 employees out of the 1000 employees, we want to apply dot-sample with frac equals zero-point-one as before. However, instead of sampling from the whole dataset, we apply dot-sample to each group separately using a lambda function. pandas will then automatically combine the rows selected from each group to return the final sample. Our sample now includes 100 employees with gender proportions that resemble the firm's population.

9. Check proportions on sample

To check if this is correct, we run the value_counts method on the sample, and indeed, our gender proportion closely matches our original population.

10. Let's practice!

We've gained some skills. Now let's implement them!