Get startedGet started for free

Practice with Survey Weights: Part 1

Given the diversity of people in the United States, many surveys have difficulty gathering perfectly representative samples of the population. This can be problematic when your research results are meant to be generalizable to the population at large. However, surveyors typically know the true proportion of demographic traits among the U.S. population, and provide survey weights that compensates for their sampling bias, by increasing the importance of underrepresented groups in individual's statistical analyses. In this problem, we will practice using survey weights.

The American film studio, Delimited Pictures, is interested in how many tickets they should expect to sell to adults for their upcoming film, Cambrian Park, and what age group are most likely to watch it. They interview people across the country, asking them whether they plan to see the movie when it comes out. Although their resulting sample was not perfectly representative of the country's adult population, a statistician was able to provide them with survey weights that compensate for this sampling error. Use these survey weights and the dataset Survey to help Delimited Pictures estimate what proportion of the U.S. adult population will see their upcoming movie, and to determine what age group is most likely to watch it.

This exercise is part of the course

Causal Inference with R - Regression

View Course

Exercise instructions

  • 1) Get a sense of survey weights by summarizing the dataframe Survey.
  • 2) Look through the following example of how survey weights can compensate for unrepresentative sampling.
  • 3) Estimate the proportion of the US adult population that plans to watch Cambrian Park (variable WillWatch) while using the provided survey weights.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# 1) To get used to what survey weights like look like, let's summarize the `Weight` variable in our dataset `Survey` with the summary() command:


    
# Note: The weights range between 0.7 and 1.6 in this dataset, but you may encounter survey datasets with much larger weights. In general, because the mean weight of 1 is closer to 0.7 than it is to 1.6, we can tell that individuals with higher weights were undersampled. As you might guess, observations with higher weights have more influence in your statistical models. 

    
# 2) As a teaser to what we are doing when we "weight" the data, let's look at the variable for gender, 'Female`. The first line below shows the proportion of females in this survey. Select the following code and hit the "Run Code" button to see the results.

    prop.table(table(Survey$Female))

# Note: This number is much too uneven to be representative of the actual US population. The following line indicates the proportion of females in these survey when they are weighted. We have generated the code for you again, so select it and hit the "Run Code" button to see the results.

    prop.table(xtabs(Weight~Female, data=Survey))

# Note: These numbers appear much more realistic. There are about an equal number of men and women in the population.   


# 3) But before we weigh the surveys, let's see what the unweighted data says is the proportion of the U.S. population that is planning to watch Cambrian Park. This is simple: just find the mean value of the variable `WillWatch`:

    mean()
    
Edit and Run Code