Get startedGet started for free

Toying with OLS III: Model Selection

NixSplash, a water conservationist group in Utah, wants to make a targeted ad campaign to encourage high water consumers to reduce their water usage. To determine who to target with their ad campaign, NixSplash conducts a survey across a random sample of the population. However, the intern who put together NixSplash's survey asked a bunch of odd questions. Use NixSplash's survey dataset, Survey, to generate the best model for predicting a household's water usage.

*** =instructions

  • 1) Examine the data from the survey
  • 2) Build a regression model that includes all statistically significant predictors of a household's water usage (Water).
  • 3) Build a regression model that includes all intuitive predictors of a household's water usage (Water).
  • 4) Decide which way of modeling is better.

This exercise is part of the course

Causal Inference with R - Regression

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Note: Here is a data dictionary about the variables:
#    Water refers to the number of gallons (in hundreds) that the respondent has used in the past year
#    YardSize refers to the number of acres that the respondent owns
#    Rainfall refers to the number of annual inches in rain that fell on the respondent's property in the past year
#    NSprinkler refers to whether the respondent's neighbor runs a sprinkler system on a daily basis
#    Dunking refers to whether the respondent lives within walking distance of a Dunking Donuts
#    Massage refers to whether the respondent has ever gotten a massage
#    Coffee refers to whether the respondent drinks coffee on a daily basis. 

# 1) As usual, let's first examine the dataset. Start by running the summary() command on the dataframe `Survey`:

    summary()


# 2) Some of the survey questions may seem irrelevant to how much water a household uses, but let's find out for certain. Let's first run an OLS regression model for water usage while using all variables in the dataset. 
  
# Note: We can find the summary statistics from a generalized linear regression model by inserting the glm() code as the parameter for the summary() command, like the following: summary(glm(Water ~ YardSize, data=Survey)). Now do that for yourself, but include all of the variables in the dataframe:



# If you did this correctly, you should see a statistically significant effect from all variables except Dunking and Massage.


# 3) Since the variables Dunking and Massage have no significant effect on Water, let's try removing them from our model and see what changes. Enter the summary(glm()) code with these variables omitted:



# So which model is better? A model that predicts water usage with all of our available variables, or a model that predicts water usage without Dunking or Coffee? It turns out that statisticians have developed a variety of tools to measure how well a model specification fits the data (i.e. a model's "goodness of fit"). The GLM function includes one such parameter, its "AIC" (Akaike information criterion). The math behind AICs is complicated, but in practice, a model's AIC can be used to compare the goodness of fit between two models. A model with a lower AIC than another is considered to fit the data better. 


# 4) In the above example, the AIC of our second model is lower than in our first example, but only very slightly ( a rule of thumb is that a difference of 5 is considered substantial). This suggests that the models fit the data more or less equally well. So which should we choose? Should NixSplash incorporate some of our unintuitive variables into their model and ad-campaign? In other words, should we include all variables in our model even if it is unclear why proximity to Dunking Donuts, Massage frequency, and Coffee-drinking habits would effect water useage? Answer Solution 4 with "Yes" or "No".

      Solution4<-""
      
Edit and Run Code