1. Shape and transformations
There are generally four characteristics of distributions that are of interest. The first two we've covered already: the center and the spread or variability of the distribution. The third is the shape of the distribution, which can be described in terms of the modality and the skew.
2. Modality
The modality of a distribution is the number of prominent humps that show up in the distribution. If there is a single mode, as in a bell-curve, it's called unimodal. If there are two prominent modes, it's called bimodal. If it has three modes or more, the convention is to refer to it as multimodal. There is one last case: when there is no distinct mode. Because the distribution is flat across all values, it is referred to as uniform.
The other aspect to the shape of the distribution concerns its skew.
3. Skew
If a distribution has a long tail that stretches out to the right, it's referred to as "right-skewed".
4. Skew
If that long tail stretches out to the left, its referred to as "left-skewed".If you have trouble remembering which is which, just remember that the skew is where the long tail is.
5. Skew
If neither tail is longer than the other, the distribution is called "symmetric".
6. Shape of income
Let's compare the distributions of median personal income at the county level on the West Coast and in the rest of the country to see what shape these distributions take. There are several plot types that we could use here. Let's use an overlaid density plot by putting income along the x axis, filling the two curves with color according to whether or not they're on the West Coast, then adding a density later and specifying an alpha level of (point) 3. This allows the colors to be somewhat transparent so that we can see where they overlap.
The plot that results shows two curves, the blue representing the West Coast distribution and the pink representing counties not on the West Coast. Each distribution has a single prominent mode, so we can say that each distribution is unimodal. You might argue that the little bump around 100,000 dollars is a second mode, but we're generally looking for larger-scale structure than that.
It's difficult to compare these distributions because they are both heavily right-skewed, that is, there are a few counties in each group that have very high incomes. One way to remedy this is to construct a plot of a transformed version of this variable.
7. Shape of income
Since income has a heavy right skew, either the square-root or log-transform will do a good job of drawing in the right tail and spreading out the lower values so that we can see what's going on. We can perform the transformation by wrapping income in the log function, which will take the natural log. The result is a picture that's a bit easier to interpret: the typical income in West Coast counties is indeed greater than that in the rest of the country and the second very small mode of high income counties in the West Coast is not present in the other distribution.
8. Let's practice!
Let's turn to some exercises to explore the shape of the Gapminder data.