Get startedGet started for free

Specifying elements of the design in R

1. Elements of a sampling design

Now that we understand survey weights, let's learn some common design structures and how they are specified using the popular survey package.

2. Simple random sampling

Suppose we want to estimate the average amount of time Pennsylvania residents spend on social media. This means our target population is the residents of Pennsylvania and our study variable is the weekly hours spent on social media.

3. Simple random sampling

Suppose we have a list of every Pennsylvania resident and we randomly picked 200 people to survey. These 200 people are denoted by the pink dots. This method of sampling is called simple random sampling. Everyone had an equal chance of being selected for our survey. Once we have collected the data and imported it into R, we now need to tell R about our sampling design. To do so, we will use the svydesign() function in Thomas Lumley's wonderful survey package. For simple random sampling, we need to specify the data frame, the column that stores the survey weights, the column that contains the population size, which statisticians call fpc, short for finite population correction, and the column that stores staging information. Since there is only one sampling stage in simple random sampling, we set id equal to tilde 1. The tildes in front of wts and N imply that these are the names of columns in the dataset paSample. To recap, the svydesign() function tells R what sampling design generated the data and creates an object that contains the data and important design information.

4. Simple random sampling

Notice that if we randomly select 200 Pennsylvanians, that many of these residents are concentrated around Pennsylvania's two largest cities, Philadelphia and Pittsburgh, since these places have a higher concentration of people. If we want to estimate the social media usage by county.

5. Simple random sampling

We can't since some counties, such as Warren and McKean, have no observations.

6. Stratified sampling

To fix this issue, we can group our population into counties.

7. Stratified sampling

And then we take a simple random sample in each county. This is called stratified simple random sampling and the groups are called strata. This sampling design works well to get a more diverse sample and is useful in computing estimates for sub-groups. To specify this sampling design, we need to map the column in the dataset that contains the groupings to the strata argument in svydesign(). For our example, the groups are the counties. The entries of fpc should now reflect the population size of the associated county. Now, this sampling design can be expensive to implement. Imagine we plan to interview each sampled person at their home. In that case, this design will cost us a lot of gas money and travel time!

8. Cluster sampling

To cut down on costs, another commonly used design is a cluster sampling design. The population is grouped in what are called clusters.

9. Cluster sampling

But now instead of taking a sample within each cluster, a simple random sample of clusters, shown in black, are drawn.

10. Cluster sampling

Within each sampled cluster, a simple random sample of people is selected. For in-person surveys, it is very common to cluster by area as it cuts down on travel time! To specify a cluster sampling design, we need to change id and fpc to reflect the sampling stages. For our example, we first sampled county and then we sampled people within county, so we set id to county plus person id. For fpc, we provide the column with the number of counties, denoted here by N1, plus the column with the number of people, denoted here by N2.

11. Let's practice!

Let's practice specifying sampling designs.