1. Assumptions and normal distributions
Welcome to chapter 4. Here, we'll look at the distributions of our samples and how they impact which tests we can use.
2. Summary stats
One way to examine a distribution is through summary stats. Each of these measures something of importance about a sample distribution. Let's consider a few of these for a simple continuous variable.
The mean is the sum of the values in the sample divided by the number of values in the sample.
The median is the midpoint of the frequency distribution of the sample, with half of the values above and below it.
The mode is the value that occurs most often in the sample.
The standard deviation is a measure of the variability around the mean of a distribution.
3. Normal distribution
A normal distribution is a special type of distribution with some interesting attributes. When samples are normally distributed, analyzing their variation is often much easier.
A normal distribution has the same mean, median, and mode, which also implies that the distribution is symmetrical.
4. Sample distribution
For example, let's look at summary stats taken from our UN dataset of country-level metrics. This dataset, which we have previously encountered in the exercises, contains multiple sociological and economic metrics, like GDP and life expectancy, for countries around the world.
Here is a density plot of life expectancy per country. The peak is off-center and the asymmetry indicates a non-normal distribution.
5. Accessing summary stats
We can easily access various summary stats from a pandas DataFrame. To retrieve the mean, median, or mode, we simply use the dot mean, dot median, or dot mode attribute. Where multiple modes are present, dot mode will return multiple values. Looking at these values, we can compare the results. The farther the distribution departs from normality, the more the values will differ from one another.
6. Normal distribution
Another feature of the normal distribution is the distribution of the variation. With a normal distribution, two thirds of the samples are within one standard deviation of the mean and 95 percent of the variation is within two times the standard deviation. This gives the characteristic "bell curve" shape.
7. Q-Q (quantile-quantile) plot
We can use the distribution of the variation of a normal distribution to examine whether or not a particular sample falls within the parameters expected from a normal distribution. A normal probability plot, a quantile-quantile or Q-Q plot, compares the quantiles observed in the data with those expected under a normal distribution. This provides us with a graphical method to assess normality. When the values observed correspond to the expected quantiles, the Q-Q plot has a straight line, meaning that we have a perfect normal distribution. Here, you can see an example of a Q-Q plot for a perfect normal distribution.
8. Creating a Q-Q plot
Let's make a Q-Q plot for the data we examined, life expectancy values for our country-level demographic data. We start by importing scipy dot stats and plotnine. Then, we'll use the probplot function on the life expectancy values from our dataset, specifying "norm" for the dist argument to indicate the quantiles for a normal distribution. We then use the zero zero index from that result, along with the sorted life expectancy values, to create a DataFrame. Finally, we pass this DataFrame to the ggplot function and specify geom underscore point.
9. Q-Q plot for sample
Here are the distribution plot and the resulting Q-Q plot. The pronounced curve in the Q-Q plot tells us that our distribution is not normally distributed. The further from a straight line we get, the further from a normal distribution we are.
10. Let's practice!
Now it's your turn to explore normal distributions.