1. Evaluating distribution choices
Now we'll further evaluate the choice of probability distributions.
2. Choosing variable probability distributions
In the last chapter, we learned that the choice of variable probability distributions starts with an intuitive understanding of the data and available distributions, often aided by exploratory visualizations.
We then use Maximum Likelihood Estimations to compare different candidate distributions and pick the best among the candidates.
The last step, discussed in this lesson, is to further evaluate the goodness-of-fit of candidate distributions against the data. The Kolmogorov–Smirnov test, often abbreviated as ks-test, is a great way to do this.
The ks-test statistic quantifies the distance between the empirical distribution of the data and the theoretical distribution of the candidate probability distribution.To perform this calculation, we'll use the kstest function in SciPy.
Calculating MLE and running ks-tests have similar but slightly different purposes: MLE yields the best candidate among a set of candidate distributions, while ks-tests provide information about whether a given probability distribution fits the data well.
3. Evaluating choice of distribution: age
Let's use the age variable in the diabetes dataset as an example.
After creating an empty list to store our results,
we define a list of three candidate distribution choices: Laplace, normal and exponential.
For each distribution, we use the getattr function to obtain the corresponding distribution from the scipy-dot-stats module based on the distribution names. The distributions themselves are attributes of the module.
Then we fit the age data of the diabetes dataset using this distribution.
Next, we perform the kstest. Three arguments are passed to the kstest function: the first is the data, in this case, the age column of the dia DataFrame; the second is the name of the probability distribution, represented here by i; and the third is the parameters obtained by fitting in the previous step.
The results of ks-test contain the test statistic and associated p values as the first and second items,
which we print for each distribution.
4. Evaluating choice of distribution: age
Let's take a closer look at the results. The ks-test statistic measures the distance between the distribution of the data and the candidate theoretical distribution. The p-value indicates the probability that the data is generated from the candidate distribution. A p-value cutoff of 0-point-05 is often used to make the determination.
Examining the results, we see that for the Laplace and exponential distributions in the first and third rows, the p-values are very small, indicating it is highly unlikely that the age data in the diabetes dataset is generated by these distributions. For the normal distribution, the p-value is around 0-point-067, indicating that we cannot rule out the possibility that the age data is generated from a normal distribution. Our choice of a normal distribution for the age data based on the Maximum Likelihood Estimation in the last chapter seems to be a reasonable choice based on the ks-test as well.
5. Evaluating choice of distribution: tc blood serum
Similarly, we can conduct a ks-test for the tc blood serum data in the diabetes dataset.
Again, using the three candidate probability distributions of Laplace, normal, and exponential, we calculate the goodness-of-fit of these distributions against the tc data using kstest.
Looking at the corresponding p-values, we see that the p-value for the normal distribution in the middle is about 0-point-19, and is the biggest among the three p-values. With the p-value well above 0-point-05, the normal distribution is probably a good choice for simulating the tc blood serum data.
6. Let's practice!
Alright, let's practice evaluating some input distributions!