Get startedGet started for free

Comparing spatially-related distributions

1. Comparing spatially-related distributions

You can get very far in comparing distributions with the combination of beeswarm plots and violin plots, but one scenario where they can get cumbersome is when you want to compare distribution that have a spatial connection.

2. What are 'spatially connected axes'?

The term 'spatially connected axes' is just a fancy way of saying that there is an ordering of some sort in the categorical axis of your distribution data. This is sometimes referred to as 'ordinal.' One example of these type of data is anything longitudinal. You may have blood pressure for patients in different months. Here there is a natural ordering to the category of the month of measurement. In the last lesson, we compared distributions that had no particular categorical order property. Red cars are not closer to blue cars than they are green cars.

3. The ridgeline plot

The ridgeline plot is a visualization technique designed for these types of scenarios. They are simply a bunch of standard kernel density plots (or sometimes histograms) that are stacked closely on top of each other allowing the viewer to see distribution level patterns in their data over the ordinal axis range. We can use the ggridges package by Klause Wilke to make them as easy as adding the geometry geom_density_ridges to our plot.

4. Ridgeline pros

A main plus of ridgeline plots is their ability to convey shifts in distribution over the ordinal axes. By taking advantage of the correlation of the ordinal axes, they fit the plots closer than they would be if you just faceted by the ordinal variable. Another plus is that they are still just standard KDEs so you can easily focus on a single distribution and gain all the insight you would otherwise have in another form or visualization.

5. Ridgeline cons

The main con with ridgeline plots is linked to its main pro. That being the close proximity of each distribution's plot which can cause overlap. This overlap is dangerous as you may miss some unique feature in a distribution simply because it was covered up by a neighboring distribution. Another con is the fact we have so many different KDEs at once. We saw how tricky it can be to fiddle with the kernel width of a single density plot, now multiply that by 10 or 20 separate distribution and you are likely to have to make some compromises. As with any plot or analysis. If you see a surprising or interesting pattern, dig into it to make sure you're not seeing a spurious relationship caused by the nuances of the method!

6. Overview of distribution visualization

Now that we've covered visualizing distribution over two chapters it pays to briefly review what we discussed. We have two main types of distribution visualization: single distribution and comparing distributions(or conditional distribution). The main plot type for single distribution data is the histogram, but due to potential difficulties with bin-placement and number it's often a good idea to use a KDE plot with a rugplot to show individual data points. For comparing distributions, the main plot type is the box-plot, but this plot is severely limited in its default form and should, when the data are small, be combined with jittered points or a beeswarm plot. When you have too many data points to do that, switching to the KDE-based violin plots is ideal. Lastly, if you want to take advantage of data with an ordinal axis, consider the ridgeline plot to view distribution level patterns over spatially connected axes.

7. Let's make some ridgelines!

Now that we've covered most distribution visualization scenarios let's finish off by practicing with some ridgeline plots.