Get startedGet started for free

Choosing probability distributions

1. Choosing probability distributions

Picking the appropriate probability distributions is vital to a successful simulation.

2. Maximum Likelihood Estimation (MLE)

To do this, we'll use a maximum likelihood estimation, or MLE, to measure the fit of a probability distribution given certain data. We calculate a likelihood function for different probability distribution parameters given observed data. The distribution with the parameters that yield the highest likelihood given the data will be considered the optimal probability distribution. We will use SciPy's dot-nnlf to calculate MLE values. The MLE value that dot-nnlf returns is actually the negative likelihood function value. Therefore, the lower the nnlf value, the better the distribution fit.

3. Picking a distribution for the age variable

Let's pick the right probability distributions to model the variables in our diabetes dataset. Looking at a histogram of the age variable, we see that a normal distribution might be a good choice. Let's use a maximum likelihood estimation to evaluate our guess against other candidate probability distributions!

4. Candidate distributions

First, let's create a list of three candidate distributions to evaluate against the age column of the diabetes dataset: here we'll use the Laplace, normal, and exponential distributions from SciPy. We have covered normal and exponential distributions. Here's an example probability density function of a Laplace distribution, which tends to have a quicker drop in density from peak to shoulder compared to the normal distribution.

5. Choosing between candidate distributions

First, let's create a list called MLEs, which will house the maximum likelihood estimations for each distribution. Next, for each of the three distributions in the distributions list, we'll fit the distribution to the age data by calling distribution-dot-fit on the age column of the dia DataFrame, saving the fitted parameters in a variable called pars. Then we'll use distribution-dot-nnlf to obtain the MLE estimate of the fitting, which returns the negative likelihood calculated, and we'll append the MLE of the distribution to the mles list. At the end of the for-loop, we print the list of mle values, corresponding to the Laplace, normal, and exponential distributions, respectively. The normal distribution fit yields the lowest mle value, so it is the best of the three distributions for describing the age variable.

6. Choosing between candidate distributions

Let's apply this MLE evaluation method to all variables of interest, which are all the variables except sex since sex showed little correlation with the response variable in the previous lesson. We'll evaluate the same three distributions, Laplace, normal and exponential, and create the mles list to store their respective scores. We fit each variable to each of three distributions by calling distribution-dot-fit on the corresponding column of the dia DataFrame. We again use distribution-dot-nnlf to obtain the MLE value of the fitting and record it in the mles list. At the end of the for-loop for each variable, we zip together the distributions and mles to create a list of tuples, each containing a distribution and its associated MLE. Then we sort the zipped results by MLE value by calling a lambda function. The lambda function defines the key for sorting as d[1], the second item of each tuple, which is the MLE. From the sorted results, we find the best_fit by extracting the first item from the sorted MLEs and distributions using index zero. Finally, we extract the distribution name using best_fit[0]-dot-name and the corresponding MLE value at best_fit index one.

7. Results of the evaluation

Let's look at the results! For every variable, the normal distribution yields the lowest mle value and is the best fit. This means that we'll use the multivariate normal distribution in the next lesson to simulate the diabetes data!

8. Let's practice!

Now, let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.