Visualizing simulation results
1. Visualizing simulation results
Let's now use visualizations to explore simulation results.2. Answering questions using Monte Carlo simulations
Recall that the predicted y value in the diabetes dataset represents predicted disease progression. Perhaps we're interested in the differences in the predicted y values for people who are in the top quartile of each predictor compared to the first quartile.3. Starting with the simulated results
Assume we have the DataFrame called df_summary containing the results of our simulation which we've seen in prior lessons. It contains predictors simulated by sampling from a multivariate normal distribution as well as predicted y values calculated deterministically. We'll use this as our starting point.4. Answering our question
Let's loop through each predictor variable in df_summary, obtaining the 25th and 75th quantiles of each. We calculate the mean differences of the predicted y values for people who are in the fourth quartile compared to the first quartile for the particular predictor. Finally, the results for each variable are saved in the dic_diffs dictionary, which we convert to a DataFrame for easy viewing. Outcomes for patients in the top quartile of BMI are much worse than those in the bottom quartile, while those in the top quarter of HDL values have better outcomes.5. Simulating 1,000 times
Assume we run the Monte Carlo simulation 1,000 times to get 1,000 different df_summary DataFrames. For each df_summary, we calculate the mean differences of the predicted y values for the fourth quartile compared to the first quartile for each predictor as we did on the previous slide. If we combined all these mean differences into a single DataFrame, it would look like this one, which we will use to visualize our results.6. Pairplot
Let's create a pairplot from the new df_diffs.7. Pairplot
Look at the sixth row or column, representing HDL: the scatterplot patterns indicate that the mean differences in predicted y of people who are in the fourth quartile and the first quartile for HDL are in a negative correlation with that of the other variables, while there is a positive correlation between the other variables themselves. Let's use a correlation heat map to make this message clear!8. Clustermap
First, we use dot-corr to calculate the correlation matrix of df_diffs. Then we can use Seaborn's clustermap function to plot the correlation matrix. The variables are clustered and ordered according to their correlation. The dark purple color represents negative correlation associated with HDL, and the pale pinkish color represents positive correlation. Again, there is a strong negative correlation between the results for HDL and positive correlations between the results for variables other than HDL.9. Converting to long format
DataFrames have two formats: wide and long. The df_diffs DataFrame is in wide format; it has nine columns, with each column corresponding to one variable such as age. DataFrames in long format are often in the form of two columns: one containing the variable name and the other the corresponding value. Using pandas' melt method, we can convert df_diffs into the long format. Let's save it as df_diffs_long. If we check the first few rows of df_diffs_long, we can see that there are only two columns: one containing the variable name such as age, and the other containing the corresponding value, y_diff. The long format is handy for creating some visualizations, such as box plots.10. Boxplot
Let's plot df_diffs_long using Seaborn's boxplot function. Except for HDL, the mean differences of the predicted y values for people who are in the fourth quartile versus the first quartile for each predictor are greater than zero. This suggests that the bigger the values of the predictors, the worse the patient outcome will be in terms of disease progression. On the other hand, the higher the HDL value, the better the patient outcome will be.11. Let's practice!
Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.