Visualizing a categorical variable

1. Visualizing a categorical variable

Now that we have specified the NHANES design, we can start to analyze the data. Let's begin by creating the visualizations of race we saw in the last chapter.

2. NHANES: visualizing race

We want to estimate the distribution of race in the US using the variable Race1, the self reported race of the participants. Ignoring the survey weights, we can use dplyr to create a new data frame of raw counts and proportions. Let's step through this code. We are taking the dataset, NHANESraw, grouping by Race1, and creating a new variable, Freq, which is the number of observations for each race, then adding a column, called Prop, which is the proportion of observations in each race, and finally arranging the table from most common to least common. We find that 36% of the sample identifies as White, 22% as Black, and so on. Notice the race categories in Race1. These represent non-Hispanic, White; non-Hispanic, Black; Mexican American; Other race; and Other, non-Mexican, Hispanic. Recently, the variable has been updated to include non-Hispanic, Asian. More information on the variables in the NHANES dataset can be found in the documentation on the CDC website.

3. NHANES: visualizing race

Now we can use ggplot2 to graph the distribution of race. In the ggplot() base layer, we provide the dataset name and the aesthetics we want to map. On the x-axis, we map the race categories; the y represents the height of each bar, given by Prop. The geometric object is actually a bar but we specify geom_col, as in column, to signify the values we want to map are given in the y column.

4. NHANES: visualizing race

If that seems confusing, notice that geom_col() produces the same graphic as geom_bar where stat equals identity.

5. NHANES: visualizing race

But the code is more succinct! The next layer flips the x and y axes, and lastly by specifying the limits in scale x discrete, we can order the bars. Do you think this sample distribution of race accurately reflects the distribution of race in the US? Probably not. Remember, minority groups are over-sampled to ensure adequate sample sizes within each group. Therefore, they make up a higher proportion of the sample than they do in the population. So, if we want to use the sample data to estimate the distribution of race, we better take our sampling design into account!

6. NHANES: visualizing race

To account for the sampling design, we will use the svytable() function to compute the frequencies for Race1, also feeding in the sample design we constructed in the last chapter. After converting the output to a data frame, the last two operations are the same as before, computing the conditional proportions and then arranging from most to least common. While the order of the categories has not changed, the estimated proportions certainly have! For example, the estimated proportion of white Americans has nearly doubled.

7. NHANES: visualizing race

To create a bar plot, we will use the same code as we did for the unweighted table. The only change is switching the data to tab w, the survey-weighted table. Now, we get a graph that more accurately reflects the distribution of race in America from 2009 to 2012.

8. Let's practice!

Now it's your turn.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Analyzing Survey Data in R

IntermediateSkill Level

4.8+

114 reviews