Visualizing the impact of survey weights

1. Impact of weights

Now that we have a handle on common survey design structures, let's look at a real world survey to see how the survey design and the weights impact our analyses.

2. National Health and Nutrition Examination Survey (NHANES)

We will explore the National Health and Nutrition Examination Survey. The goal of NHANES is to assess the health of people in the US. Because the survey includes a medical exam in a mobile health vehicle, the researchers have put a lot of care into developing a cost-effective, representative sampling design. The data are collected in four stages. First, the US is stratified by geography and distribution of minority populations. Then counties are randomly selected within each stratum, where more populated counties are more likely to be sampled. From the sampled counties, city blocks are randomly selected, where again more populated blocks are more likely to be sampled. From the sampled city blocks, households are randomly selected based on demographic information. And lastly, within the sampled households, people are randomly selected for inclusion in the sample.

3. NHANES

The 2009 to 2012 sample data, called NHANESraw, can be found in the NHANES package. Running the dim() command returns the number of rows, or observations, and the number of columns, or variables, contained in the dataset. We see that the NHANESraw dataset contains 78 variables on 20,293 people. Before specifying the design, we need to modify the survey weights variable. WTMEC2YR. WTMEC2YR provides the number of people in the US each sampled person represents. Therefore, summing up the weights, via the summarize() command should provide a rough estimate of the total number of people in the US. However, we get an estimate of 608 million people, about twice as many as we should! That is because these weights were constructed assuming you have two years of data. Since we have four years, we need to divide each weight by 2. To do that, we use the mutate() function to create a new column, WTMEC4YR, where each value is half the value in WTMEC2YR.

4. NHANES

Let's specify the design with the R function svydesign(). In the arguments, we need to provide the dataset, NHANESraw and the strata column, SDMVSTRA. Remember, id is where we specify the variables that represent the clusters. While the design actually had three levels of clustering (counties, city blocks, and households), it's common in practice to only specify the first level, denoted here by SDMVPSU. Running the distinct() function on SDMVPSU, we see it only takes on three values, 1, 2, and 3. This is because 1 to 3 counties were sampled within each strata. Therefore, we must include nest equals TRUE because the cluster ids are nested within the strata. Lastly, the survey weights are in given in WTMEC4YR.

5. Visualizing impact of weights

Now for some analyses! Suppose we want to estimate the distribution of race in the US. I created these two plots using the race variable in the NHANES dataset. In the top graph, I accounted for the survey weights and in the bottom graph, I didn't. Notice how different the distribution of race is between these two plots. The survey weights account for the sampling design, in which minorities groups are over-sampled, they adjust for non-response, and are calibrated to known information about the population. In essence, if we ignore them, we will get a very wrong graph! The moral of the story is that survey weights cannot be ignored. And, in the rest of this course, we will learn to how to ensure that the graphs, the models, the analyses properly handle the weights!

6. Let's practice!

Time for some practice with the NHANES weights!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.