Why do we need simulations?
In the last lesson, you performed a multivariate normal distribution using the mean and covariance matrix of dia
. Now, you'll answer questions of interest using the simulated results!
You may ask: why do we perform simulations when we have historical data? Can't we just use the data itself to answer questions of interest?
This is a great question. Monte Carlo simulations are based on modeling using probability distributions, which yield the whole probability distribution for inspection (a large number of samples), rather than the limited number of data points available in the historical data.
For example, you can ask questions like what is the 0.1st quantile of the age
variable for the diabetes patients in our simulation? We can't answer this question with the historical data dia
itself: because it only has 442 records, we can't calculate what the one-thousandth value is. Instead, you can leverage the results of a Monte Carlo simulation, which you'll do now!
The diabetes dataset has been loaded as a DataFrame, dia
, and the following libraries have been imported for you: pandas
as pd
, numpy
as np
, and scipy.stats
as st
.
This exercise is part of the course
Monte Carlo Simulations in Python
Exercise instructions
- Calculate the 0.1st quantile (the bottom 1,000th) of the
tc
variable in the simulated results.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
cov_dia = dia[["age", "bmi", "bp", "tc", "ldl", "hdl", "tch", "ltg", "glu"]].cov()
mean_dia = dia[["age", "bmi", "bp", "tc", "ldl", "hdl", "tch", "ltg", "glu"]].mean()
simulation_results = st.multivariate_normal.rvs(mean=mean_dia, size=10000, cov=cov_dia)
df_results = pd.DataFrame(simulation_results, columns=["age", "bmi", "bp", "tc", "ldl", "hdl", "tch", "ltg", "glu"])
# Calculate the 0.1st quantile of the tc variable
print(____)