Get startedGet started for free

Calculating p-values from t-statistics

1. Calculating p-values from t-statistics

In the previous lesson, you calculated t, the test statistic.

2. t-distributions

The test statistic, t, follows a t-distribution. t-distributions have a parameter called the degrees of freedom, or df for short. Here's a line plot of the PDF of a t-distribution with one degree of freedom in yellow, and the PDF of a normal distribution in blue dashes. Notice that the t-distribution for small degrees of freedom has fatter tails than the normal distribution, but otherwise they look similar.

3. Degrees of freedom

As you increase the degrees of freedom, the t-distribution gets closer to the normal distribution. In fact, a normal distribution is a t-distribution with infinite degrees of freedom. Degrees of freedom are defined as the maximum number of logically independent values in the data sample. That's a fairly tricky concept, so let's try an example.

4. Calculating degrees of freedom

Suppose your dataset has 5 independent observations. Four of the values are 2, 6, 8, and 5. You also know the sample mean is 5. The last value is no longer independent; it must be 4. Even though all five observations in the sample were independent, because you know an additional fact about it, you only have 4 degrees of freedom. In our two sample case, there are as many degrees of freedom as observations, minus two because we know two sample statistics.

5. Hypotheses

Recall the hypotheses for our Stack Overflow question about compensation for the two age groups. Since this is a "greater than" alternative hypothesis, we need a right-tailed test.

6. Significance level

We're going to calculate a p-value in a moment. We need to decide on a significance level before we do that. There are several possibilities; I'm going to use point-one. That means that if the p-value is less than point-one, we reject the null hypothesis in favor of the alternative.

7. Calculating p-values: one proportion vs. a value

In Chapter one, to get the p-value, you transformed the z-score with the normal CDF. Since it was a right-tailed test, you set lower-dot-tail to FALSE.

8. Calculating p-values: two means from different groups

Now we are calculating means rather than proportions, the z-score is replaced with a t test statistic. This is the value calculated in the previous video. The calculation also needs the degrees of freedom, which is the number of observations in both groups, minus two. In the previous slides, we used an approximation using sample information (not bootstrapping) for the test statistic standard error. A consequence of this is that to calculate the p-value, we need to transform the test statistic using the t-distribution CDF instead of the normal distribution CDF. Using this approximation adds more uncertainty and that's why this is a t instead of a z problem. The t distribution allows for more uncertainty when using multiple estimates in a single statistic calculation. Notice the use of pt instead of pnorm, and that the df argument is set to the degrees of freedom. This p-value is less than the significance level of point-one, we should reject the null hypothesis in favor of the alternative hypothesis that Stack Overflow data scientists who started coding as a child earn more.

9. Let's practice!

While I reevaluate my childhood and wonder why I didn't start programming earlier, time for you to do some exercises.