Explore the data

1. Explore the data

In this chapter, we'll learn how to perform a principled Monte Carlo simulation. The first step, and the topic of this lesson, is to thoroughly explore the data to understand the dependent and independent variables as well as their relationships.

2. The diabetes dataset

Our dataset is a publicly available diabetes dataset. We'll use a Monte Carlo simulation to understand the impact of diabetes predictors on the response. The dataset has ten independent variables, or predictors: age, sex, body mass index, average blood pressure, and six blood serum measurements: tc, ldl, hdl, tch, ltg, and glu.

3. The diabetes dataset

The response, also known as the dependent variable y, is a quantitative measure of disease progression one year after baseline. There are 442 diabetes patients in the dataset.

4. The diabetes dataset

The diabetes dataset has been loaded into a DataFrame named dia. An examination of the DataFrame shows that all ten independent variables and the dependent variable y are numerical.

5. Why do we explore data before simulation?

Before we explore our diabetes data, let's discuss the goals of our exploration. First, we want to visually inspect the distribution of the variables, which provides intuition for which probability distribution to use in the simulation. Second, we need to check and measure the correlation between predictor variables, providing rationales for modeling covariance structures between these predictors. Third, we want to check and measure the correlation between predictor variables and the response, providing an initial understanding of the relationship between them.

6. Pairplot of the dataset

Let's use a pairplot to inspect the variables. Pairplots are great tools for inspecting variable distributions and their pairwise correlations The subplots on the diagonal show a histogram of each variable.

7. Pairplot of the dataset

For example, the second histogram in the diagonal counting from the top corresponds to the histogram for the second variable in the DataFrame, which is sex. Unlike the other variables in the dataset, which tend to have a range of possible values, sex only has two values, zero and one, representing male and female. The pairwise scatterplots indicate the relationship between variables.

8. Pairplot of the dataset

Looking at the pairwise scatterplot of the fifth and sixth variables, the tc and ldl blood serums, we can see that the points in their pairwise scatterplot center around the diagonal of the subplot, indicating a strong positive correlation between them.

9. Correlations between different variables

We can also leverage the corr method in pandas to perform pairwise correlation between all the variables and gain numerical insight. In the results shown, the values in the diagonal lines show the correlation of each variable with itself, so these values are all one. Values in the other cells indicate the correlation coefficients between the corresponding variables. For example, the correlation coefficient of tc and ldl is around 0-point-897, the highest among the pairwise correlation coefficients. This agrees with the strong positive correlation we detected visually in the previous slide. Let's focus on the last row, or the last column, which calculates the pairwise correlation between the dependent variable y and each independent variable. This is a very crude way of measuring the predictive power of each independent variable for y. Sex has the lowest correlation coefficient with y, of around 0-point-043. For this reason, we'll use all variables except the sex variable for our simulations and related analysis.

10. Let's practice!

Alright, let's continue exploring the diabetes dataset in the exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.