1. Intro to comparing distributions
In the last chapter, we learned all about visualizing distributions by themselves. However, what if we're interested not just in the shape of a given distribution, but how it compares to other distributions?
2. Why compare distributions?
Why would we want to do this?
There are many motivating reasons that will change depending on the context you are doing data science in.
You may be constructing a model that quantifies the sales performance between two stores and you want to make sure the populations the stores are serving are balanced on key covariates.
You could also be interested in characterizing the differences as the end goal itself, as we will do with our speeding data.
3. Why not facet histogams?
Why don't we just do a histogram faceted on the variable of interest?
In many scenarios, this is a great way of proceeding; but there is one downside of this tactic that often makes it not ideal: faceted histograms or KDEs are not space-efficient.
This comes to bite us in two main scenarios. The first being when we have lots of groups we want to compare. Soon the plot gets clogged with a bunch of repeated axes and cutoff facet-labels and you have to make the visualization huge to fit it all.
Second, we might want to display some summary statistics like the median and quantiles of the distributions along with the distribution itself. Again, this gets cluttered when your plot anything other than large.
4. The boxplot
The most common distribution comparison visualization is the boxplot.
The boxplot is a simple construction. It consists of a box, with opposing ends falling at the 25th and 75th percentiles of the data respectively. In addition to the ends, a vertical line is usually drawn within the box to indicate the median of the data and horizontal lines of length 1.5 times the width of the box (also known as the interquartile range or IQR) are drawn from the box's ends. Finally, points falling outside the reach of the IQR bars are plotted individually and are considered 'outliers'.
5. Boxplot pros
What are the pros of the boxplot?
Like many plots we've discussed: most people know what a boxplot is and how to interpret them.
This familiarity helps boxplots be efficient. There is usually no need to explicitly label the different measures the boxplot shows because readers are already familiar with them. You should be careful not to assume that every audience will know these values, but for more technically minded viewer's you can usually get away with it.
6. boxplot cons
You can probably already guess the cons based upon the previous chapters discussion of KDE plots... We can't see the data.
A potentially huge amount of data points and corresponding nuances in the data can be hidden within the box of a boxplot. Take for instance this example. Both of these two datasets have the exact the same boxplot but the story is a bit different when we look at the raw points!
7. A simple addition
Luckily there is a super simple addition you can make to your boxplot code that will drastically improve your visualization.
By adding the geometry geom_jitter before your boxplot geometry, ggplot will draw all the raw data points used to construct the boxplot, slightly jostled to avoid overlapping as much as possible.
8. How to make jittered points
Instantly we can see some characteristics we are familiar with in this data. There is strong digit preference in the lower speeds that slowly dissipates as the speeds get higher.
9. Let's compare some distributions!
Now that you know the benefits and drawbacks of boxplots, and a quick way to improve them, lets do a few examples!