Get startedGet started for free

Inference for categorical variables

1. Inference for categorical variables

2. NHANES: Race and Diabetes

As we saw in the last video, there appears to be a relationship between a person's race and the likelihood they've been diagnosed with diabetes. We decided this by observing in our plot, that the prevalence rates differ by group. But, remember, these estimates are only based on a random sample of the US population and therefore, maybe the association that we are observing could be due purely to random chance. We now need to run a formal statistical test to determine whether or not we have evidence of an association between race and diabetes.

3. Inference: Chi-square Test

Because we have two categorical variables, we will run a chi-square test for association. Remember when using hypothesis tests, we first define a null hypothesis, which often contains a null claim. For categorical variables, this claim would be that there's no relationship and for our example, we'd say that the prevalence of diabetes is not associated with race. We also define an alternative hypothesis, which contains the competing, and usually more interesting claim. In our case is that prevalence of diabetes is associated with race. We then determine how consistent our sample results are with the null hypothesis. If they are fairly consistent, then it is plausible the variables aren't related and we don't have evidence for the competing claim. But if our sample results are just too inconsistent with the null, then we suspect the null is wrong and we have evidence for the alternative, that is, the variables are related. We obtain our measure of consistency with the null by running svychisq() on our two variables and specifying our design. The output contains two important pieces, the test statistic, labeled X-squared, and the p-value. The test statistic summarizes the discrepancy between our sample results and the sample results that would have been most consistent with the null. The larger the test statistic, the greater the discrepancy. But how large must it be for us to call our sample results too inconsistent? This is where the p-value comes in! The p-value determines how likely it is we would have observed that test statistic (or an even more unusual one) if there truly is no relationship between our two variables, Race1 and Diabetes. This probability can be computed because the test statistic can be approximated by a known distribution, a chi-squared distribution. And, that is where the name of this test comes from! Since our p-value is so small here, this tells us it would be very unlikely to see the differences in diabetes rates that we observed if the rate is actually consistent across all racial groups in the US. Therefore, we have evidence that diabetes prevalence does vary by race.

4. Let's practice!

Now that we have reviewed hypothesis testing, it's your turn to test for an association between depression and perceived health. Then you will wrap up the chapter with one more exercise where you tie all the pieces together.