Box plots

1. Box plots

By now you've had quite a bit of experience using box plots to visualize the distribution of numerical data, but let's dig deeper to understand how exactly they are constructed by starting with a dot plot.

2. Insert title here...

The box plot is based around three summary statistics:

3. Insert title here...

the first quartile of the data,

4. Insert title here...

the second quartile,

5. Insert title here...

and the third quartile. You might be more familiar with the

6. Insert title here...

second quartile as the median, the value that's in the middle of the dataset. It's the second quartile because two quarters, or half, of the data is below it, and half is above it.

7. Insert title here...

The first quartile, then, has only one quarter of the data below it and

8. Insert title here...

the third quartile has three quarters of the data below it.

9. Insert title here...

These three numbers form the box in the box plot,

10. Insert title here...

with the median in the middle and the first and third quartiles as the edges. One thing you always know when looking at a box plot is that the middle half of the data is inside this box. There are various rules for where to draw the whiskers, the lines that extend out from the box.

11. Insert title here...

The one used by ggplot2 is to draw it out 1 (point) 5 times the length of the box, then draw it into the first observation that is encountered. The particular rule is less important than the interpretation, which is that the whiskers should encompass nearly all of the data.

12. Insert title here...

Any data that is not encompassed by either the box or the whiskers is represented by a point. This is one of the handy features of a box plot: it flags for you points that are far away from the bulk of the data, a form of automated outlier detection.

13. Insert title here...

14. Side-by-side box plots

Let's revisit the side-by-side box plots that you constructed in your exercise. This shows the distribution of city mileage broken down by cars that have 4 cylinders, 6 cylinders, and 8 cylinders. We can look to the heavy line in the boxes and learn that that median mileage is greatest for 4 cylinders and less for 6 cylinders. For 8 cylinder cars, something odd is going on: the median is very close to the third quartile. In terms of variability, the 4 cylinder cars again have the widest box and whiskers that extend the farthest. The middle half of the data in 6 cylinder cars spans a very small range of values, shown by the narrow box. Finally we see some outliers: one 6 cylinder car with low mileage and several 4 cylinder cars with high mileage.

15. Side-by-side box plots

If you're wondering about that highest outlier in the 4 cylinder category, that is indeed a hybrid vehicle.

16. Side-by-side box plots

Notice that in terms of syntax, ggplot actually expects you to be plotting several box plots side-by-side. If you want to see just a single one, you can just set the x argument to 1. Box plots really shine in situations where you need to compare several distributions at once and also as a means to detect outliers. One of their weaknesses though is that they have no capacity to indicate when a distribution has more than one hump or "mode".

17. Insert title here...

Consider the density plot here, there are two distinct modes.

18. Insert title here...

If we construct a box plot of the same distribution, it sweeps this important structure under the rug and will always only provide a single box.

19. Let's practice!

Now that you know a bit more about how box plots are constructed, it's time for you to construct some yourself.