Multiple comparisons problem
The multiple comparisons problem arises when a researcher repeatedly checks different variables/samples against one another for significance. Just by random chance we expect to find an occasional result of statistical significance.
In this exercise you'll work with data from salaries for employees at the City of Austin, TX. You will compare their salaries against randomly generated data. You will see how often this random data is "significant" in explaining the salaries of employees. Clearly any such "significance" would be spurious, as random numbers aren't very helpful in explaining anything!
A DataFrame of police officers salaries (police_salaries_df
) has been loaded for you, as have the packages pandas as pd
, NumPy as np
, Matplotlib as plt
, and stats
from SciPy.
This exercise is part of the course
Foundations of Inference in Python
Exercise instructions
- Store the number of people in the dataset in
n_rows
(each row is a person), and initialize the number of significant results,n_significant
, to zero. - Write a
for
loop which runs 1000 times and generatesn_rows
random numbers. - Compute Pearson's R and the associated p-value between these randomly generated numbers and the police officer salaries.
- If the p-value is significant at 5%, add one to
n_significant
using the+=
operator.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Compute number of rows and initialize n_significant
n_rows = ____
n_significant = ____
# For loop which generates n_rows random numbers 1000 times
for i in ____:
random_nums = np.random.uniform(size=____)
# Compute correlation between random_nums and police salaries
r, p_value = stats.____(____, random_nums)
# If the p-value is significant at 5%, increment n_significant
if ____ < ____:
____ += 1
print(n_significant)