How are the Parkfield interearthquake times distributed?

1. How are the Parkfield interearthquake times distributed?

Knowing how the time between major earthquakes is distributed makes a big difference for assessing when the next earthquake will strike. It turns out that the Parkfield sequence has been central in the science of earthquake prediction.

2. The Parkfield Prediction

In the mid-80s, seismologists predicted that the next Parkfield quake would occur in 1988, and almost certainly no later than 1993. They performed a linear regression as a basis for their prediction, which is essentially assuming a Gaussian model. But the earthquake did not come in 1988, nor in 1993. It came in late 2004. In light of this, you will work out whether we can dismiss the Exponential model, presumably favoring the Gaussian model, in the exercises. For illustration now, we will look at the Nankai Trough earthquakes in the context of the Gaussian model.

3. Hypothesis test on the Nankai megathrust earthquakes

We will test the hypothesis that the time between Nankai megathrust earthquakes are Normally distributed, parametrized with a mean and standard deviation calculated from the observed earthquakes. We are left to specify the test statistic and what it means to be at least as extreme as.

4. The Kolmogorov-Smirnov statistic

What is a reasonable test statistic to measure how close an ECDF is to a theoretical CDF, in this case a Normal CDF?

5. The Kolmogorov-Smirnov statistic

Above, I plot the distance of the ECDF from the theoretical Normal CDF. We might take as our test statistic the maximum of these distances. This seems like a reasonable measure of the distance between the empirical CDF and that of the distribution we are testing against.

6. The Kolmogorov-Smirnov statistic

The maximal distance occurs around 150 years and has a value of about 0.2. This maximal distance has a name: the Kolmogorov-Smirnov statistic, or K-S statistic for short. But how do we compute it without having to make graphs like this?

7. The Kolmogorov-Smirnov statistic

It helps to look at where the local maximal distances occur. In every case, the local maximum is at a corner of the formal ECDF.

8. The Kolmogorov-Smirnov statistic

Note that the corner can either a concave corner at the top of a step, or at a convex corner at the base of a step. You will use these ideas to write a function to compute the Komogorov-Smirnov statistic in the exercises.

9. Kolmogorov-Smirnov test

So, now that we have our test statistic, which is always positive, it is clear that "at least as extreme as" means that the simulated K-S statistic is greater than or equal to the observed K-S statistic. The hypothesis test we just defined is called the Kolmogorov-Smirnov test. We are now left to figure out how to simulate acquiring the data under the null hypothesis.

10. Simulating the null hypothesis

Taking a hacker stats approach, we first generate the theoretical CDF by drawing many, like ten thousand, samples and storing them. Now, say we have *n* data points. For the Nankai dataset, *n* = 8. Then, to generate each Kolmogorov-Smirnov replicate, we draw *n* samples from the theoretical distribution. We then compute the K-S statistic using these *n* samples and the ten thousand samples we drew out of the theoretical distribution. Here is the technique in code. We can use functions in NumPy's `random` module to make the samples. The key part of the test, then, in computing the K-S statistic. You will write the `ks_stat()` function to do this in the exercises. Incidentally, the p-value for the hypothesis than the Nankai Trough earthquakes follow the Gaussian model is close to 0.9, so the data are commensurate with that model.

11. Simulating the null hypothesis

12. Let's practice!

Now it's time for you to compute p-values for the Parkfield sequence to see how it jibes with the Exponential model.

This exercise is part of the course

Case Studies in Statistical Thinking

IntermediateSkill Level

4.9+

Start Course for Free

To begin, you'll use two data sets from Caltech researchers to rehash the key points of Statistical Thinking I and II to prepare you for the following case studies!

Exercise 1: Activity of zebrafish and melatonin Exercise 2: EDA: Plot ECDFs of active bout length Exercise 3: Interpreting ECDFs and the story Exercise 4: Bootstrap confidence intervals Exercise 5: Parameter estimation: active bout length Exercise 6: Permutation and bootstrap hypothesis tests Exercise 7: Permutation test: wild type versus heterozygote Exercise 8: Bootstrap hypothesis test Exercise 9: Linear regressions and pairs bootstrap Exercise 10: Assessing the growth rate Exercise 11: Plotting the growth curve

In this chapter, you will practice your EDA, parameter estimation, and hypothesis testing skills on the results of the 2015 FINA World Swimming Championships.

Exercise 1: Introduction to swimming data Exercise 2: Graphical EDA of men's 200 free heats Exercise 3: 200 m free time with confidence interval Exercise 4: Do swimmers go faster in the finals?Exercise 5: EDA: finals versus semifinals Exercise 6: Parameter estimates of difference between finals and semifinals Exercise 7: How to do the permutation test Exercise 8: Generating permutation samples Exercise 9: Hypothesis test: Do women swim the same way in semis and finals?Exercise 10: How does the performance of swimmers decline over long events?Exercise 11: EDA: Plot all your data Exercise 12: Linear regression of average split time Exercise 13: Hypothesis test: are they slowing down?

Some swimmers said that they felt it was easier to swim in one direction versus another in the 2013 World Championships. Some analysts have posited that there was a swirling current in the pool. In this chapter, you'll investigate this claim! References - <a href="https://qz.com/761280/researchers-believe-certain-lanes-in-the-olympic-pool-may-have-given-some-swimmers-an-advantage/" target="_blank">Quartz Media</a>, <a href="https://www.washingtonpost.com/news/wonk/wp/2016/09/01/these-charts-clearly-show-how-some-olympic-swimmers-may-have-gotten-an-unfair-advantage/?utm_term=.dba907006ba1" target="_blank">Washington Post</a>, <a href="https://swimswam.com/rio-olympic-test-event-showed-same-pool-bias-2-0/" target="_blank">SwimSwam</a> (and also <a href="https://swimswam.com/problem-rio-pool/" target="_blank">here)</a>, and <a href="https://www.ncbi.nlm.nih.gov/pubmed/25003776" target="_blank">Cornett, et al</a>.

Exercise 1: Introduction to the current controversy Exercise 2: A metric for improvement Exercise 3: ECDF of improvement from low to high lanes Exercise 4: Estimation of mean improvement Exercise 5: How should we test the hypothesis?Exercise 6: Hypothesis test: Does lane assignment affect performance?Exercise 7: Did the 2015 event have this problem?Exercise 8: The zigzag effect Exercise 9: Which splits should we consider?Exercise 10: EDA: mean differences between odd and even splits Exercise 11: How does the current effect depend on lane position?Exercise 12: Hypothesis test: can this be by chance?Exercise 13: Recap of swimming analysis

Herein, you'll use your statistical thinking skills to study the frequency and magnitudes of earthquakes. Along the way, you'll learn some basic statistical seismology, including the Gutenberg-Richter law. This exercise exposes two key ideas about data science: 1) As a data scientist, you wander into all sorts of domain specific analyses, which is very exciting. You constantly get to learn. 2) You are sometimes faced with limited data, which is also the case for many of these earthquake studies. You can still make good progress!

Exercise 1: Introduction to statistical seismology and the Parkfield experiment Exercise 2: Parkfield earthquake magnitudes Exercise 3: Computing the b-value Exercise 4: The b-value for Parkfield Exercise 5: Timing of major earthquakes and the Parkfield sequence Exercise 6: Interearthquake time estimates for Parkfield Exercise 7: When will the next big Parkfield quake be?Exercise 8: How are the Parkfield interearthquake times distributed?

Current Exercise

Exercise 9: Computing the value of a formal ECDF Exercise 10: Computing the K-S statistic Exercise 11: Drawing K-S replicates Exercise 12: The K-S test for Exponentiality

Of course, earthquakes have a big impact on society, and recently are connected to human activity. In this final chapter, you'll investigate the effect that increased injection of saline wastewater due to oil mining in Oklahoma has had on the seismicity of the region.

Exercise 1: Variations in earthquake frequency and seismicity Exercise 2: EDA: Plotting earthquakes over time Exercise 3: Estimates of the mean interearthquake times Exercise 4: Hypothesis test: did earthquake frequency change?Exercise 5: How to display your analysis Exercise 6: Earthquake magnitudes in Oklahoma Exercise 7: EDA: Comparing magnitudes before and after 2010 Exercise 8: Quantification of the b-values Exercise 9: How should we do a hypothesis test on differences of the b-value?Exercise 10: Hypothesis test: are the b-values different?Exercise 11: What can you conclude from this analysis?Exercise 12: Closing comments