Get startedGet started for free

Multiple comparisons problem

The multiple comparisons problem arises when a researcher repeatedly checks different variables/samples against one another for significance. Just by random chance we expect to find an occasional result of statistical significance.

In this exercise you'll work with data from salaries for employees at the City of Austin, TX. You will compare their salaries against randomly generated data. You will see how often this random data is "significant" in explaining the salaries of employees. Clearly any such "significance" would be spurious, as random numbers aren't very helpful in explaining anything!

A DataFrame of police officers salaries (police_salaries_df) has been loaded for you, as have the packages pandas as pd, NumPy as np, Matplotlib as plt, and stats from SciPy.

This exercise is part of the course

Foundations of Inference in Python

View Course

Exercise instructions

  • Store the number of people in the dataset in n_rows (each row is a person), and initialize the number of significant results, n_significant, to zero.
  • Write a for loop which runs 1000 times and generates n_rows random numbers.
  • Compute Pearson's R and the associated p-value between these randomly generated numbers and the police officer salaries.
  • If the p-value is significant at 5%, add one to n_significant using the += operator.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Compute number of rows and initialize n_significant
n_rows = ____
n_significant = ____

# For loop which generates n_rows random numbers 1000 times
for i in ____:
  random_nums = np.random.uniform(size=____)
  # Compute correlation between random_nums and police salaries
  r, p_value = stats.____(____, random_nums)
  # If the p-value is significant at 5%, increment n_significant
  if ____ < ____:
    ____ += 1
    
print(n_significant)
Edit and Run Code