Get startedGet started for free

Cluster Sampling

1. Cluster Sampling

We are now going to explore how to perform cluster sampling on a survey DataFrame in Python.

2. What is cluster sampling?

Cluster sampling means that the entire population is divided into several subgroups, and each of these subgroups has characteristics similar to the population. In survey analysis, we are dividing the whole population into clusters and then randomly selecting clusters to form a sample for the next step to be performed. It is important to note that in survey analysis, cluster sampling does not sample individuals, but randomly selects the whole subgroup.

3. Why cluster sampling is important

The reason cluster sampling is important is because we cannot always gather data from the entire population. Whenever we have a large number in population, the likelihood that random sample parameters exactly match those of the population is low. Therefore, cluster sampling also helps to minimize error caused when sampling from very large populations.

4. Steps in cluster sampling analysis

The steps for performing a cluster sampling analysis are as follows. First, we divide the population into clusters. This is our sampling frame. We can choose natural groupings such as the geographical region in the sampling frame. Second, we pick a random sample of these clusters to form our representative sample, assuming each cluster is a mini-representation of the population at whole. Let's do this with an example.

5. Sample dataset

Let's look at a survey that aims to measure attitudes towards mental health in the tech workplace. The survey asks respondents their gender, the country where they work, and whether or not they seek professional treatment for their mental health.

6. Sample dataset and plot

First, we group the population into clusters by work location using the groupby function, and then count the number of respondents on the gender column, using the count function. To plot a bar graph of our results, we first need to reset the index and rename the columns of our DataFrame, then call the plot dot bar function, indicating x as country_live and y as the number of respondents in each country. It is evident that the majority of our respondents work in the US.

7. Choose clusters

From our clusters of countries, we can then randomly choose some clusters to represent the population and analyze from. To randomly choose 10 clusters for example, we first create a list of the unique countries present in the survey with the set function within the list function, then from the numpy library, use the random-dot-choice function to pick ten countries from the list. By setting the replace parameter to False, we are ensuring that our sample is unique. We'll call this random selection random_clusters.

8. Create cluster sample

Then to select respondents only from our random list of countries, we subset the survey country_live column to only include rows that is in the random_clusters. The isin function will help us to do this.

9. Plot cluster sample

If we want to see the distribution of tech workers that sought treatment for their mental health, we could call the value_counts function on the sought_treatment column, and create a pie chart of the responses. Looks like there's almost a 50-50 chance that tech workers seek professional treatment for their mental health.

10. Plot cluster sample

If we were to randomly choose another cluster, we see that about 64% of tech workers sought treatment, while about 36% did not. The point of cluster sampling is to choose clusters to represent the population, while the others remain unrepresented in the study. The difference in results is due to sampling error, which is why other sampling methods, like weighted sampling, are used to minimize error.

11. Let's practice!

Now it's your turn to practice some cluster sampling.