1. Summary statistics
Let's use our simulation results to answer questions of interest with the help of summary statistics!
2. Age outcomes
Perhaps we are interested in the difference in diabetes outcomes, the predicted y, for people in the first and fourth quartiles of age.
3. Age outcomes
Using NumPy's quantile function, we can find the 25th and 75th quantiles for age in the simulation results DataFrame, df_summary.
The quantiles are at about 39-point-8 and 57-point-5 years of age, respectively.
4. Age outcomes
Let's assign these quantiles to age_q25 and age_q75.
Then we can filter the results in df_summary to find the patients that are in either the fourth or first quartile of the age distribution. We then use the predicted_y column to extract the corresponding diabetes outcome values and np-dot-mean to calculate the mean outcome values for each quartile.
Based on the simulation, the difference in the diabetes outcomes for people in the first and fourth quartiles of age is about 34. Remember that this is a measure of disease progression, with bigger values meaning more severe disease progression. Our results show that older patients have worse disease progression than younger patients.
5. Outcome differences based on age and bmi
Our second question is a little more complicated: what is the difference in outcomes for people in the first and fourth quartiles of BMI and age?
And what are the 95 percent confidence interval and standard deviation for this difference?
In this case, we are not only interested in a point estimate based on one simulation result. Instead, we want to find the distribution for the answer and the uncertainty associated with it.
6. Outcome differences based on age and bmi
We'll use a for-loop to conduct the simulation.
Within each iteration of the for-loop, we perform a multivariate normal simulation 1,000 times to obtain a set of simulation results, which we turn into a DataFrame.
Then we perform a deterministic calculation using regr_model to obtain predicted_y values, which we also save in a DataFrame called df_y. We combine the simulated X values with the predicted y values into a summary DataFrame called df_sum.
Then we define the quantiles of BMI and age as age_q25, age_q75, bmi_q25, and bmi_q75.
These values are used to filter for patients in the fourth or first quartiles and save their mean outcomes as q25 and q75_outcomes.
Finally, we calculate the difference in mean outcomes for these two groups and save the results in y_diff. This is a point estimate for a one-time simulation of 1,000 multivariate normal samples.
With a for-loop of length 1,000, we will get 1,000 point estimates of y_diff, each of which will be recorded in the list y_diffs.
7. Outcome differences based on age and bmi
Now, let's use summary statistics from the y_diffs list to answer our question.
Using np-dot-mean, we see that the average mean difference is about 132-point-5.
We use np-dot-quantile to calculate the 2-point-5 and 97-point-5 quantiles of the mean difference to obtain the 95 percent confidence interval, which is about 120-point-7 to 144-point-1.
We then use np-dot-std to calculate the standard deviation, which is about 6-point-9.
Last but not least, we inspect the results using a histogram.
Our results suggest that the difference in disease progression between older patients with higher BMI and younger patients with lower BMI is even greater than the difference between older patients and younger patients. From our data exploration, we know that both age and BMI are positively associated with disease progression, so this outcome makes sense!
8. Let's practice!
Let's practice those summary statistics!