Get startedGet started for free

The General Social Survey

1. The General Social Survey

Welcome to Inference for Categorical Data, the fifth course in the statistics track. Categorical data arises in any situation where the values that you are recording are categories, not simply numbers.

2. GSS Data Explorer, only footer

One particularly rich trove of categorical data can be found in the General Social Survey.

3. GSS Data Explorer, with people chatting

Every year, researchers visit the homes of Americans and ask them a long list of questions about their history, behavior, and opinions on a number of topics of interest to social scientists.

4. GSS Data Explorer, with world

There are generally a few thousand people that are surveyed every year, but researchers...

5. GSS Data Explorer, world and arrows

would like to make general statements about the opinions and social trends of the entire United States.

6. GSS Data Explorer, with people

This process of inference from the sample to the population is possible because the researchers are careful to select their respondents in such a way that their sample is representative of the population of all Americans. The result is a dataset where each sampled respondent is one row and each column is their response to a single question.

7. Exploring GSS

In R, a sample of this data is stored as a data frame called gss. If we glimpse the data frame, we learn that there are 25 variables: a unique identifier for each respondent, then a series of demographic variables like age and sex. Note that while these two are numerical data, represented as double and integer, the remaining variables are factors, R's term for categorical data. If we go farther down we get into the opinion questions. The happy variable records if respondents on balance feel happy or unhappy. I'm curious to learn what the distribution of responses are to this question in the most recent year of the survey, 2016.

8. Exploring GSS

My first step is to filter the dataset to only include those rows and save it as a new dataset called gss 2016. Since this is categorical data, I'll visualize it with a bar chart. We learn that the most common response was "happy"

9. Exploring GSS

but lets go a step farther and calculate exact proportion of the sample that responded this way.

10. Exploring GSS

To do that we want to summarize the happy variable with a single proportion. Look to the middle of this line we ask for which respondents their happy value is exactly equal to "happy". This results in a column of trues and falses. You can find the proportion of trues by simply taking the mean. I'll save that as p_hat. We learn that around 77 percent of our sample is "happy". This should be a good estimate of the percent of all Americans that are very happy, but it's not a sure thing since we only asked a small proportion of them.

11. General 95% confidence interval

To capture the uncertainty in our estimate, we can create a confidence interval by adding and subtracting two standard errors from p-hat. We can estimate the standard error by using the bootstrap.

12. Bootstrap

We start with our full dataset and specify the variable that we'd like to focus on.

13. Bootstrap

This is done with the specify function.

14. Bootstrap

Next we draw a sample from that variable with replacement that is of the same size as our original dataset. This recreates the random variation that creeps in when you draw a sample from a population.

15. Bootstrap

We do this many times to create many bootstrap replicate datasets.

16. Bootstrap

This is done with generate.

17. Bootstrap

Next, for each replicate,

18. Bootstrap

we calculate the sample statistic,

19. Bootstrap

in this case: the proportion of respondents that said "happy".

20. Bootstrap

This is the role of calculate.

21. Bootstrap

At this point, I like to save this object: the collection of statistics from repeated resampling of our dataset.

22. Bootstrap

From here, we can look at their distribution using ggplot.

23. Bootstrap

This is called the "bootstrap" distribution.

24. Bootstrap

The standard deviation of this distribution is a good estimate of the standard error

25. Bootstrap

so our last step is to extract that using summarize.

26. Bootstrap Confidence Interval

To implement this, we start with our gss2016 data and then specify that we will focus on the happy column. Next we generate 500 replicate datasets through bootstrapping and for each one calculate the proportion that are "happy". When we print this new object, we see we now have a data frame that contains 500 p-hats.

27. Bootstrap Confidence Interval

If we create a density plot of these statistics we see that it's unimodal and symmetric and ranges from roughly point-7 to point-8-5.

28. Bootstrap Confidence Interval

If we calculate the standard deviation of the stat variable, we see it's about point-0-3-4. With this standard error in hand, we can form our confidence interval by adding and subtract twice that value from p-hat. We learn that we can be 95% confident that the proportion of all Americans that are "happy" is between point-7-0-5 and point-8-4-1.

29. Let's practice!

OK, now it's your turn to practice with confidence intervals.