Exploring two categorical variables

1. Exploring two categorical variables

Now that we have a handle on how to properly account for the sampling design when we summarize and visualize a single categorical variable, let's look at exploring the relationship between two categorical variables.

2. NHANES: race and diabetes

As NHANES is a survey about the health of the US, let's explore a health variable, Diabetes, which has two categories: Yes, if a health professional has diagnosed the participant with diabetes, and No, if not. As before, we can use svytable() to create a frequency table of the estimated prevalence of diabetes. Based on the table, we'd estimated that around 24 million people have been diagnosed with diabetes in the US. Instead of exploring diabetes in isolation, let's ask how diabetes rates vary by race. For this, we need a contingency table where each entry contains the counts for a combination of the two variables. To acquire a survey-weighted contingency table of Race1 and Diabetes, we will again use svytable() and include both variables, separated by a plus sign. tab_w contains the estimated frequencies. For example, we'd estimate that 4 million black Americans have been diagnosed with diabetes while over 32 million have not.

3. NHANES: race and diabetes

Before we graph the table, we need to transform it into a data frame where each row represents one of the entries of the contingency table. Now, we now have three columns, Race1, Diabetes, and Freq, the estimated count. To graph the data, we need to specify three aesthetic arguments: x equals Race1, we will fill the bars by Diabetes, and then the height of each fill, y, is given by Freq for geom_col.

4. NHANES: race and diabetes

This creates a stacked bar graph where we are stacking the diabetes bars within each racial group. Notice that for ease of reading, I flipped the coordinates. We can make a couple of observations. First, each bar has more pink than blue because for each racial group, one is more likely to not have been diagnosed with diabetes. Second, the height of the bars reflect the estimated counts in each racial group. Therefore, white is the tallest bar since it is the most common group. To answer our original question, "Do diabetes rates vary by race?", we need to compare the relative proportion of blue to pink in each racial group.

5. NHANES: race and diabetes

We can do that by adding position equals fill in geom_col. Now the height of each fill is the conditional proportion within a racial group. With this segmented bar plot, it is easier to see that some differences do exist in the diabetes rates between groups. In the next section, we will learn how to formally compare these rates via hypothesis tests.

6. Let's practice!

But first, it’s time to practice creating survey-weighted contingency tables!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Analyzing Survey Data in R

IntermediateSkill Level

4.8+

128 reviews