Get Started

Box plots and distribution characteristics

1. Box plots and distribution characteristics

There's an alternative to histograms for drawing the distribution of a continuous variable: the box plot.

2. Box plots

The box plot is created by plotting all individual data points along the axis. Each circle represents a data point or observation.

3. Box plots

Then, the median is calculated, as the midpoint of your data: fifty percent of the amount of data points are below this line, 50% are above. The median is also known as the second quartile, since it is positioned at the second quarter of your data.

4. Box plots

The box is formed by the first and third quartile, or the lower and upper quartile, respectively. The twenty-five percent of your lowest data points are below the lower quartile. The twenty-five percent of your highest data points are above the upper quartile. The distance between these two lines is called the interquartile range, or IQR. Consequently, fifty percent of your data points lie within this box.

5. Box plots

Lastly, the horizontal lines, known as the whiskers, have a length of one and a half times the IQR by default. They can be shorter however, when the minimum or maximum value lies within the whisker. Every data point outside the whiskers is considered an extreme value, or outlier. In this example, there is just one outlier, the maximum value of this dataset.

6. When to use a box plot

The power of box plots lies within the fact that you can compare multiple distributions simultaneously. Here we have the amount of sales figures for eight manufactures of office supplies. Comparing these using histograms would require plotting eight separate histograms, aligning them, and then try to differentiate between the categories. Plotting box plots for each of the manufacturers can be done in a single graph, allowing to spot differences and trends more easily. You can see, for example, that Canon, Hon, HP, and Polycom have unusually high sales figures outside the upper whisker range. Wasp appears as a single line because there is only one order with one sale for that manufacturer.

7. When to use a box plot

In general, you use box plots when you want to compare distributions among multiple categories. You lose some detail compared to plotting histograms, but it makes it far easier to spot trends and differences between the categories of interest.

8. What about the mean?

What box plots don't show, is the average value of the observations. Often, average refers to the arithmetic mean, as the sum of the observations divided by the number of observations. However, average and mean are used interchangeably. Of course, you could add the average to a box plot, like here. In this case, the average is in the third quartile, but it can be close to the median or in the second quartile.

9. Skewness

To understand why, we need to look at two characteristics of continuous distributions we didn't cover so far. A first characteristic is the so-called skewness, or asymmetry of the distribution. A right-skewed distribution has a longer tail on the right,

10. Skewness

a left-skewed distribution has a longer tail on the left. Right-skewed and left-skewed are also called positive and negative skewness, respectively.

11. Skewness

A skewness of zero means that your data is normally, or symmetrically, distributed. Skewness can be visualized using histograms and box plots.

12. Skewness

Each vertical pair of histogram and box plot shows the same data, highlighting the symmetric or asymmetric nature of the data. Notice the position of the average and the median: in a right-skewed distribution, the average lies to the right of the median, and to the left in a left-skewed distribution. Since most statistical models assume normal, or symmetrically distributed data, recognizing skewness is an essential first step when performing EDA.

13. Excess kurtosis

The second characteristic is the excess kurtosis, or the spread of the extreme values. Leptokurtic or positive kurtosis refers to lots of extreme values, resulting in a narrow peak. Platykurtic or negative kurtosis means the opposite, few extreme values, resulting in a broad peak. Mesokurtic (zero excess kurtosis) has the shape of the bell curve of a normal distribution. Again, both histograms and box plots can be used to show the amount of excess kurtosis, which allow you to compare the amount of outliers in your dataset.

14. Let's practice!

Let's recap and then create some box plots in Tableau!