Visualizing quantitative data

1. Visualizing a quantitative variable

Now it's time to explore graphing quantitative survey variables.

2. Table of means

Let's return to the physical health example where we explored the average number of bad days for smokers and non-smokers. We estimated the means using svyby(). Let's use the output, stored as out, to create a bar graph of the means.

3. Bar graphs

To create the bar graph, we specify the dataset, out, in the ggplot() base layer and the aesthetics. We place SmokeNow, the grouping variable, on the x-axis and then the height of the bar, y, is given by DaysPhysHlthBad. labs() adds the labels, which we neglected in earlier chapters, but are now ready to incorporate!

4. Bar graphs with error bars

Sometimes it's nice to add error bars to give the viewer a sense of uncertainty about the bar's true height. Here the top of the error bar is one standard error above the group mean and the bottom is one standard error below. To create this plot we need to add these error calculations to our data frame.

5. Bar graphs with error bars

Using mutate(), I added the columns lower and upper, which are computed by respectively, subtracting and adding the standard error from the estimated means.

6. Bar graphs with error bars

Now we modify our ggplot() code by mapping lower to ymin and upper to ymax in the aesthetics and then adding a new geom, geom_errorbar, where width controls the size of the error bars.

7. Histogram

We may want to display more than just the central tendency of a survey variable. For example, maybe we want to see the shape and distribution of the number of poor health days. For this we can create a histogram or a density plot. For the histogram, we use geom_histogram. The aesthetics are our variable of interest, DaysPhysHealthBad, and weight, our sampling weights. Weighting each participant by their survey weight is an important step towards ensuring we get a shape that accurately estimates the distribution of poor health days for the population, Americans over 12 years old. Not using the weights will only reflect the distribution of the sample and won't account for the complex sampling design! Within geom_histogram, I also specified a binwidth of 1 so that each bar represents a single day and colored the bars white so they are easier to differentiate. Now, as we saw from the median, most people have zero bad health days in a month. But, bad health is very right skewed with small peaks at 10, 15, 20, and 30 days. The bar at 30 represents the portion of the population who feel their health is poor for the whole month.

8. Density plot

A density plot will also give us a sense of shape. Since the area under the curve represents probabilities, it must integrate to one and therefore, we have a few data wrangling steps to perform before creating the plot. First, we'll remove NAs for our variable via filter() and then we will add a column of weights that sums to 1. Notice we can then pipe this data directly into ggplot()! We replace geom_histogram with geom_density and we have our plot! bw is an important argument of geom_density as it controls the smoothness of the curve.

9. Faceted density plots

We can incorporate a categorical variable into our density plot through faceting. In this case, the density curve is constructed for the quantitative variable within each group of the categorical variable. Again, we have to do some data wrangling before constructing the graph, filtering out missing values and then standardizing the weights within each group.

10. Faceted density plots

Notice, I added one more layer to the plot, facet_wrap, to facet on SmokeNow.

11. Let's practice!

Now it's your turn.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.