Get startedGet started for free

Visualization with scatter plots

1. Visualization with scatterplots

In this chapter, we will explore visualizing and modeling the relationships between quantitative variables. So far in the course, we have primarily focused on estimating finite population parameters, such as the prevalence of diabetes for specific racial groups or average hours of sleep. Now we will focus on describing how two variables relate. When the goal is modeling, the population of interest tends to extend beyond the specific group that was sampled.

2. Head size and age

Continuing with the NHANES data, let's look at the relationship between age in months and head circumference in centimeters. Since head circumference was only measured for babies 6 months and younger, let's first filter the data to only include those rows. There are 484 babies 6 months and younger in the NHANESraw dataset.

3. Scatterplots

The usual graphic to visualize the relationship between two quantitative variables is the scatter plot. For scatter plots, the geometric object is a point where one quantitative variable is mapped to x and the other to y. Here we mapped AgeMonths to x and HeadCirc to y. Scatter plots are great for displaying trends in the data. For example, we see that as age increases head size also increases and in a fairly linear fashion. One issue is that since age is in months, there are only 7 distinct values, 0, 1, 2, and so on. Therefore, many of the points are plotted on top of each other and we can't discern where points are more or less dense.

4. Jittering

We can remedy this issue by adding a little random noise to the points which is called jittering the points. To do this, we swap out geom_point for geom_jitter and then specify how much we want to jitter the points. Since stacking is only an issue in the x direction for this data, I have only specified a width and no height. So the points will only move a bit left or right, not up or down. Now we can see the data more clearly and still observe the upward trend. However, the main issue with our scatter plot is that it doesn't incorporate the sampling design. Here, each dot here is a baby in the NHANES sample, but remember each baby represents a different number of babies in the population.

5. Survey-weighted scatterplots

Let's add our survey weight variable, WTMEC4YR, to our dataset. This tells us the first baby represents 12,915 babies while the third only represents 2,359. We want to account for these sampling weight differences in our scatter plot.

6. Bubble plots

One way to do that is by mapping the weights to the size aesthetic. Then the size of the bubbles is proportional to the weight where a larger bubble represents a larger survey weight. The last layer specifies that I don't want a legend. This is called a bubble plot. For this example we are back to the overplotting issue where the bubbles lying on top of one another.

7. Bubble plots

Therefore, I have set alpha equal to 0-point-3 so that all the points are fairly opaque and we can see the stacking. Notice that most of the stacking is occurring along the linear trend we observed earlier.

8. Survey-weighted scatterplots

We can also use saturation of color to signify our survey weight. To do this, we now map the weights to color instead of size where the darker the color, the larger the survey weight.

9. Survey-weighted scatterplots

Another option is opacity where the more transparent a point, the smaller the survey weight.

10. Let's practice!

Now let's practice creating scatter plots that account for the sampling design!