Get startedGet started for free

Performing t-tests

1. Performing t-tests

In the previous lesson, you calculated t, the test statistic.

2. Two-sample problems

Here, we'll look at the related problem of comparing sample statistics across groups of a variable. In the Stack Overflow dataset, converted_comp is a numerical variable of annual compensation. age_first_code_cut is a categorical variable with two levels: child and adult, which describe when the user started programming. We can ask questions about differences in compensation across the two age groups. That is, do users who first programmed as a child tend to be compensated higher than those that started as adults?

3. Hypotheses

The null hypothesis is that the sample mean for the two groups is the same, and the alternative hypothesis is that the sample statistic for users who started coding as children is greater than for users who started coding as adults. We can write these hypotheses using equations. Mu represents an unknown population mean, and we use subscripts to denote which group the sample mean belongs to. An alternate way of writing the equations is to compare the differences in sample means to zero.

4. Calculating groupwise summary statistics

Calculating the summary statistics for each group is straightforward. You start with the sample, group by the categorical variable, then summarize the numeric variable. A dplyr way of doing this is shown, but there are many possibilities. Pause the video for a moment and think about different ways of solving this. Here, the child programmers have a mean compensation of one hundred and thirty eight thousand dollars compared to one hundred and eleven thousand for adult programmers. Is that increase statistically significant or could it be explained by sampling variability?

5. Test statistics

Although we don't know the population mean, we estimate it using the sample mean. x-bar is used to denote a sample mean. Then we use subscripts to denote which group a sample mean corresponds to. The difference between these two sample means is called the test statistic for the hypothesis test. The z-scores you saw in Chapter one are another type of test statistic.

6. Standardizing the test statistic

z-scores are calculated by taking the sample statistic, subtracting the mean of this statistic as the population parameter of interest, then dividing by the standard error. In the two sample case, the test statistic, denoted t, uses a similar equation. You take the difference between the sample statistics for the two groups, subtract the hypothesized difference between the two groups, then divide by the standard error.

7. Standard error

To calculate the standard error, needed for the denominator of the test statistic equation, bootstrapping is most accurate. However, there is an easier way to approximate it. You calculate the standard deviation of the numeric variable for each group in the sample, and the number of observations for each group. Then it's just some arithmetic.

8. Assuming the null hypothesis is true

Here's the test statistic equation again. If we assume that the null hypothesis is true, there's a simplification we can make. The null hypothesis assumes that the population means are equal. That makes one term in the numerator disappear. Using the approximation for the standard error, we now have a way of calculating the test statistic using only calculations on the sample dataset. We need the mean, standard deviation, and number of observations for each group. This is another group-by and summarize combination.

9. Calculating the test statistic

Annoyingly, it requires some grim data manipulation code to calculate the test statistic from this data frame. Since this isn't a course on data manipulation, let's do the calculation with separate variables. The numerator is a simple subtraction, and the denominator is like a weighted hypotenuse. The t-statistic is two-point-four. Just as with z-scores, we can't do anything with that number yet; you'll need to wait for the next video.

10. Let's practice!

In the mean time, time to calculate some test statistics.