NHANES Data Cleaning
During data cleaning, we discovered that no one under the age of 16 was given the treatment. Recall that we're pretending that the variable that indicates if a doctor has ever advised them to reduce fat or calories in their diet is purposeful nutrition counseling, our treatment. Let's only keep patients who are greater than 16 years old in the dataset.
You also may have noticed that the default settings in ggplot2
delete any observations with a missing dependent variable, in this case, body weight. One option for dealing with the missing weights, imputation, can be implemented using the simputation
package. Imputation is a technique for dealing with missing values where you replace them either with a summary statistic, like mean or median, or use a model to predict a value to use.
We'll use impute_median()
, which takes a dataset and the variable to impute or formula to impute by as arguments. For example, impute_median(ToothGrowth, len ~ dose)
would fill in any missing values in the variable len
with the median value for len
by dose
. So, if a guinea pig who received a dose of 2.0 had a missing value for the len
variable, it would be filled in with the median len
for those guinea pigs with a dose
of 2.0.
This exercise is part of the course
Experimental Design in R
Exercise instructions
- Create
nhanes_filter
by usingfilter()
to keep anyone older than 16 in the dataset, not including those who are 16. Age is stored in theridageyr
variable. - Load
simputation
. Useimpute_median()
to fill in the missing observations ofbmxwt
innhanes_filter
, grouping byriagendr
. - Recode the
nhanes_final$mcq365d
variable by setting any observations with a value of 9 to 2 instead. Verify the recoding worked withcount()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Filter to keep only those 16+
nhanes_filter <- ___ %>% filter(___)
# Load simputation & impute bmxwt by riagendr
___
nhanes_final <- impute_median(___, ___)
# Recode mcq365d with recode() & examine with count()
nhanes_final$mcq365d <- recode(nhanes_final$mcq365d,
`1` = 1,
`2` = 2,
`9` = ___)
___ %>% ___