1. Addressing complexities in experimental data
Next, we will look into addressing complexities in experimental data, focusing on identifying and mitigating issues like interactions, confounding variables, and heteroscedasticity.
2. Geological data
The mineral_rocks dataset encompasses 300 rock samples, detailing attributes like rock type, geographical location, mineral hardness, and rock porosity. Each entry in the dataset represents a unique sample, identified by its SampleID, and characterized by varying levels of MineralHardness and RockPorosity across different rock types and locations. Understanding the distribution and interactions within this data is critical for selecting the right statistical tests for our analysis.
3. Understanding data complexities
Our exploration begins by identifying potential complexities within our mineral_rocks dataset:
Interactions between rock types and their mineral hardness might influence the observed mineral properties.
The variance in rock porosity, a key feature of our dataset, might not be consistent across all samples, indicating potential heteroscedasticity.
There could be confounding variables that affect both mineral hardness and rock porosity. This is often the hardest problem to solve as it likely means that further data gathering is necessary to retrieve that extra variable information.
Understanding these issues helps us decide whether parametric tests, which assume normality and homoscedasticity, can be employed or if we should rely on non-parametric tests, not assuming a specific distribution.
4. Addressing interactions
With the mineral_rocks dataset, we begin by visualizing the relationship between MineralHardness and RockPorosity, colored by RockType. This initial exploration helps identify potential complexities, such as interactions between variables.
We seem to have an interaction between rock type and mineral hardness on rock porosity from the plot, since there are distinct groupings by RockType.
Addressing interactions helps us understand whether more robust non-parametric methods are necessary for accurate analysis.
5. Addressing heteroscedasticity
Heteroscedasticity refers to the changing variability of a variable across the range of another variable. We use Seaborn's residplot to check for heteroscedasticity in our data, plotting residuals of RockPorosity against MineralHardness. We include the lowess smoothing option to show the trend in the data going from left to right.
We see that, overall, the lowess line remains somewhat close to 0 and relatively flat, but the curve does lead us to be a little cautious since it highlights the spread being different in some areas of our data.
6. Non-normal data
When the residual plot deviates from expectations, it can be useful to explore the distribution of the variables used.
Here, we investigate RockPorosity with a histogram using Seaborn's displot function. We see that the data is skewed and of a non-normal shape.
7. Data transformation with Box-Cox
To address issues like skewness and heteroscedasticity, we can apply data transformations. Here, we use the Box-Cox transformation from scipy.stats on RockPorosity to stabilize variance and make the data more closely resemble a normal distribution. We add the transformed data as a column to our DataFrame. The Box-Cox transformation requires non-zero entries, which we have for all RockPorosity values.
Note that this transformed data isn't perfectly normal, but does have much more of that bell shape than it did originally.
8. Post-transformation analysis
To verify that we've better addressed the heteroscedasticity with the Box-Cox transformation, we can repeat our residplot with the TransformedRockPorosity. This visualization helps us understand whether the Box-Cox transformation has successfully stabilized the variance across the range of MineralHardness, an important assumption for many statistical tests.
The lowess line is now much flatter, going from left to right across the plot. We can now feel more confident that this transformed data has better addressed heteroscedasticity than the non-transformed data.
9. Let's practice!
Time to put these techniques into practice!