Get startedGet started for free

Exercise 17 - Comparing Within-Poll and Between-Poll Variability

We compute statistic called the t-statistic by dividing our estimate of \(b_2-b_1\) by its estimated standard error:

$$ \frac{\bar{Y}_2 - \bar{Y}_1}{\sqrt{s_2^2/N_2 + s_1^2/N_1}} $$ Later we learn will learn of another approximation for the distribution of this statistic for values of \(N_2\) and \(N_1\) that aren't large enough for the CLT.

Note that our data has more than two pollsters. We can also test for pollster effect using all pollsters, not just two. The idea is to compare the variability across polls to variability within polls. We can construct statistics to test for effects and approximate their distribution. The area of statistics that does this is called Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to answer questions such as: is there a pollster effect?

Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation.

This exercise is part of the course

HarvardX Data Science Module 4 - Inference and Modeling

View Course

Exercise instructions

  • Group the polls data by pollster.
  • Summarize the average and standard deviation of the spreads for each pollster.
  • Create an object called var that contains three columns: pollster, mean spread, and standard deviation.
  • Be sure to name the column for mean avg and the column for standard deviation s.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Execute the following lines of code to filter the polling data and calculate the spread
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-15" &
           state == "U.S.") %>%
  group_by(pollster) %>%
  filter(n() >= 5) %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
  ungroup()

# Create an object called `var` that contains columns for the pollster, mean spread, and standard deviation. Print the contents of this object to the console.


Edit and Run Code