Summarizing quantitative data

1. Summarizing quantitative data

Since we are feeling comfortable analyzing categorical survey data, let's shift our focus to a quantitative variable.

2. Summary statistics

Let's look at the variable DaysPhysHlthBad from NHANES, which is the number of days a participant has felt their physical health was bad in the last 30 days. Only participants who were at least 12 years old were asked the question so I have filtered to only include those rows and the variable of interest. The amount of information we can glean from looking at a snapshot of the raw data is limited and therefore we should ask ourselves, "What statistics can we compute with this variable to better understand the physical health of Americans?"

3. Mean, total, and median

Maybe we want to estimate the average number of poor health days. We can compute this value using svymean(). We must specify the variable of interest, our design, and na-dot-rm equals TRUE to remove the missing values. The output tells us that we would estimate that, in a given month, Americans have on average, 3.3 days of not good physical health. The standard error, a measure of uncertainty, is also reported. There are other quantities we may want to estimate, such as the total number of poor health days in a given month. To obtain the total, we can swap out svymean() for svytotal(). Though for this example, we are probably more interested in measures of center. Therefore, we can compute the median. To do this, we will use svyquantile() and add the argument quantiles equals 0-point-5, since the median is the 50th quantile. In this case, the median equals 0, which means we'd estimate that a majority of Americans have zero bad health days in a given month.

4. Summarizing by group

In survey statistics, it's also common to estimate quantities within different sub-groups of the population. We saw this in the last chapter, where we looked at diabetes prevalence within racial groups. For a quantitative variable, like DaysPhysHlthBad, our sub-groups could be smokers and non-smokers where we want to estimate the average number of poor health days for both groups. To do this, we will use the function svyby(). For formula, we specify the variable of interest, DaysPhysHlthBad, and then in the argument by, we provide the grouping variable, SmokeNow. FUN stands for the function we want to perform, which in this case is the survey weighted mean. Again, we must specify that the missing values should be removed and row dot names equals FALSE removes rownames. We see that smokers have, on average, one more bad physical health day, than non-smokers in a given month. Remember, survey data are observational, not experimental data, and therefore, we can't conclude causation. There could be confounding variables at work here. For example, smoking could be a proxy for age. In other words, maybe the smoking group is older than the non-smoking group and it is actually age, not smoking, that is impacting healthiness.

5. Summarizing by group

We can estimate the average age of smokers and non-smokers by replacing DaysPhysHlthBad with Age in svyby(). It turns out that the smokers are, on average, younger than the non-smokers. And, therefore, if we controlled for age, we maybe see an even bigger difference in health quality between the smokers and non-smokers. But what other confounding variables might we be missing? Moral of the story: always think about the origin of your data when you are drawing conclusions! And, I should note that while we couldn't conclude a causal link here, there have been many studies that have shown that smoking has a negative impact on health.

6. Let's practice!

Now it's your turn to do some summarizing!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.