1. Weighted sampling
Welcome back!
2. What is weighted sampling?
Weighted sampling is a technique in which the probabilities of selecting a particular person to participate in our survey are no longer equal. Some subgroups in our population are assigned a higher probability of being selected, based on our determined survey needs.
This allows researchers to correct issues that occur during data collection.
3. Weighted sampling common variables
Weighted sampling is commonly performed on demographic characteristics like gender, age, location, and education.
This enables us to minimize the effects of certain behaviors, for example that some groups are more likely to participate in research studies versus others.
4. Cell-based weighting
The simplest type of weighted sampling is cell-based weighting.
When we first compare a simple random sample to the population, we see that the sample data proportions do not match the population data proportions on the left. Therefore, the sample data must be proportioned according to the population.
5. Cell based-weighting
To calculate the weights applied to each age group in the sample, each group's weight is created by dividing the population proportion by the sample proportion.
6. Cell based-weighting
Then we multiply each data weight with the sample data proportions to match the target population proportions. This inflates our sample data by exactly the right amount to represent the population.
7. Cell based-weighting
Comparing the population to the sample, we see that the proportion from each age group now matches.
8. Analyzing youth spending patterns with Python
We have an internet survey dataset that provides the spending habits of young people aged 15-30 in Slovakia. The variable, entertainment, includes responses to whether or not participants think they spend a lot of money on entertainment.
9. Analyzing youth spending patterns with Python
To visualize the proportions of our sample dataset using pandas, first, we create a crosstab of the variables, Gender and Entertainment, then plot a horizontal bar chart.
Evidently, most males think they spend a lot of money on entertainment, while it is almost a 50-50 split among the females.
But is this the true representation of young people in Slovakia? We'll use weighted sampling to find out.
10. Analyzing youth spending patterns with Python
First, let's calculate the ratio of respondents for each category.
On the original sample, yp_survey, we start by calling the groupby function on the Gender and Entertainment columns. On the Age column, we count the number of respondents in each category with the count function, then call the reset_index function to treat Gender and Entertainment as columns again. We rename the columns to Gender, Entertainment, and Respondents.
11. Analyzing youth spending patterns with Python
To find the percentage of each category, we divide the respondents column by the sum of the respondents column to get the proportion of each category in the sample.
If the results from a more comprehensive survey than our internet survey revealed that the actual population percentage of females that agree is 35%, of females that disagree is 25%, of males that agree is 20%, and of males that disagree is 20%, we account for this with weights, so we add a column that shows the actual population percentages.
To calculate weights, we divide the actual population group percentages by the sample group percentages.
We then calculate our weighted respondents by multiplying the Weight column by the Respondents column.
12. Analyzing youth spending patterns with Python
Finally we plot the original respondents' values to our weighted respondents' values using the dot plot dot barh function.
Where we initially saw the majority of males thought they spent a lot of money on entertainment, our weighted sample shows that in actuality, males equally report they do and do not spend a lot of money on entertainment.
Also, where we initially thought that females had a similar response whether they did or did not spend a lot of money on entertainment, the weighted sample shows that most females think they spend a lot of money on entertainment.
Without weighted sampling, our female agreers and male disagreers are underrepresented and our female disagreers and male agreers are overrepresented.
13. Let's practice!
Practice time!