Exercise

Multiple comparisons problem

The multiple comparisons problem arises when a researcher repeatedly checks different variables/samples against one another for significance. Just by random chance we expect to find an occasional result of statistical significance.

In this exercise you'll work with data from salaries for employees at the City of Austin, TX. You will compare their salaries against randomly generated data. You will see how often this random data is "significant" in explaining the salaries of employees. Clearly any such "significance" would be spurious, as random numbers aren't very helpful in explaining anything!

A DataFrame of police officers salaries (police_salaries_df) has been loaded for you, as have the packages pandas as pd, NumPy as np, Matplotlib as plt, and stats from SciPy.

Instructions

100 XP
  • Store the number of people in the dataset in n_rows (each row is a person), and initialize the number of significant results, n_significant, to zero.
  • Write a for loop which runs 1000 times and generates n_rows random numbers.
  • Compute Pearson's R and the associated p-value between these randomly generated numbers and the police officer salaries.
  • If the p-value is significant at 5%, add one to n_significant using the += operator.