Get startedGet started for free

Comparing groups

1. Comparing groups

In this lesson, we will look at how to compare groups.

2. What does this mean?

What exactly does comparing groups mean? For instance, say you wanted to compare how the pollution for the month of August compares to the rest of the months of the year. Does August generally have higher pollution values than the other months? Is the distribution of values wider? narrower? These comparisons can help shed light on patterns and are crucial for accurately representing your data.

3. Comparing a couple classes

Say you want to compare just two classes. For instance, pollution values for Denver as compared to the rest of the cities in our dataset. If you have a continuous measure, as we do with our pollution data, a great way to compare the values is to use overlaid kernel density plots.

4. The kernel density estimator

The kernel density estimator (or KDE) plot is a kind of continuous histogram. To construct the distribution a series of small 'kernel' distributions (usually normal distributions) are stacked on top of each datapoint. The result is a continuous estimation of the underlying density of the data. This helps you avoid comparing overlaid histograms as you can plot a simple line instead of having the user guess if histogram bars are stacked or overlaid.

5. Kernel density example

Here we are again adding a column to our pollution data containing info if the city is Denver and then feeding this modified DataFrame to Seaborn's kdeplot() function. The two curves clearly show a difference in the shapes of the two groups distribution's, with the red Denver curve, shifted to the left from the rest of the cities. The KDE here has the benefit of showing the area of overlap between the distributions much better than a histogram as the overlapping lines are much easier to decipher than overlapping bars.

6. Kernel density tweak

One caveat of KDEs is that the kernels fill in ranges of the data next to the points they correspond to. If you observe a 5.5 and 5.7 in your dataset, it's safe to assume that a 5.6 is possible. However, sometimes this filling property can imply support in areas of your plot that it doesn't exist; such as in between round numbers for data like counts that only take on integer values. A good way to be open about the true support of your data, while also providing the nice shape interpretation of a KDE, is to put small dashes on the x-axis where every data point falls. This is called a rug plot, and they are very helpful when you don't have a ton of data.

7. Comparing many classes

What if we are interested in comparing not just two classes to each other but multiple classes? The overlapped kernel density plots we just saw will get rather cluttered after two or three classes.

8. The beeswarm plot

Luckily there is another well-regarded plot that we can use. The beeswarm plot is an alternative to the standard boxplot. It involves giving each class you wish to compare a vertical (or horizontal) line; then each data point is placed on the line corresponding to where it falls on the continuous axis. Points are slightly jostled in order to get them as tightly packed as possible.

9. Beeswarm example

As a result of the small jostling to get points closer together we end up having the 'swarms' of points that give the viewer a sense of the distributional shape. Here we use Seaborn's swarmplot() function to make a beeswarm plot. You can see how the city of Fairbanks has slightly lower values than the rest, Vandenberg Air Force Base has higher, and Houston is the most variable.

10. Let's compare!

Let's put these techniques to practice with our pollution dataset.