1. Comparing means with a t-test
In this video we return to the American Community Survey data and compare average pay for citizens and non-citizens.
2. A (more) standard measure of pay
One obvious measure of pay is annual income. However instead of comparing annual incomes directly, we should adjust for the number of hours worked, and compare average hourly rates. So first, we create a new variable called hrly_rate.
To do so we'll make the assumption that there are 52 weeks in a year.
and calculate hourly rate as income divided by weekly hours times 52 weeks in the year.
3. Research question and hypotheses
Our research question is "Do the data provide convincing evidence of a difference between the average hourly rate of citizens and non-citizens in the US?". We use a hypothesis test to answer this question.
If mu is defined as average hourly pay in the population,
The null hypothesis states there is no difference between the average hourly rates of citizens and non-citizens, and
the alternative hypothesis follows from the research question, stating that there is a difference between the average hourly rates of citizens and non-citizens.
4. Summary statistics
Let's take a look at the relevant summary statistics. To do so, we group the data by citizenship status and then calculate mean, standard deviation, and sample size for each group. The observed average hourly rates for non-citizens and non-citizens are indeed different, but we want to know how they compare in the population, not just in this sample.
5. Conducting the test
We can use the t-dot-test function to conduct this test. The first argument is a formula of the form of y vs. x, in this case hourly rate vs. citizenship status.
Next we specify the data frame these variables live in, and then the null value and the alternative hypothesis.
The null hypothesis says the two group means are equal to each other.
Or alternatively that the difference between the two groups means is 0, hence the null equals 0 in the function call.
Since we're looking for a difference, the alternative is two.sided.
6. Conducting the test
The p-value of the test is 0.5637, which is higher than any reasonable significance level. Hence, we fail to reject the null hypothesis and conclude that the data do not provide convincing evidence of a difference between the average hourly rate of citizens and non-citizens in the US.
7. Conditions
Before we wrap up our discussion of comparing means across two groups, let's also review conditions for this test.
First, we need independence. But independence of what?
Observations in each sample should be independent of each other. We discussed earlier that this is a difficult condition to check, but that if the study employs random sampling and/or random assignment, and for studies that employ random sampling without replacement -- which is almost all observational studies like polls and surveys -- the sample sizes are less than 10% of their respective populations, we can be fairly certain that the observations in each of the samples are independent of each other. The American Community Survey employs random sampling, and our sample sizes are definitely less than 10% of all citizens and non citizens in the US.
Also, observations across the two samples should be independent of each other, in other words, that the data should not be paired.
Lastly, if the population distributions are skewed, we need larger sample sizes. A side-by-side box plot of the two distributions show a fair amount of skew in each sample, but the sample sizes are pretty large as well (58 and 901). Hence we should be able to assume that the sampling distribution of difference in sample means is fairly normal.
8. Let's practice!
Time to put this into practice.