Get startedGet started for free

Percentiles, outliers, and box plots

1. Percentiles, outliers, and box plots

The median is a special name for the 50th percentile;

2. Percentiles on an ECDF

that is 50% of

3. Percentiles on an ECDF

the data are less than the median. Similarly, the 25th percentile

4. Percentiles on an ECDF

is the value of the data point that is greater than 25% of the sorted data, and so on for any

5. Percentiles on an ECDF

other percentile we want. Percentiles are useful summary statistics, and can be computed

6. Computing percentiles

using np dot percentile. We just pass a list of the percentiles we want (percentiles, not fractions), and it returns the data that match those percentiles. We can do this for all of the swing states. Let's compute the 25th, 50th, and 75th percentiles. We now have three summary statistics. Now the whole point of summary statistics was to keep things concise, but we're starting to get a lot of numbers here. Dealing with this issue is where quantitative EDA meets graphical EDA.

7. 2008 US election box plot

Box plots were invented by John Tukey himself to display some of the salient features of a data set based on percentiles. Here, we see a box plot showing Obama's vote share from states east and west of the Mississippi River. The center of the box is the median,

8. 2008 US election box plot

which we know is the 50th percentile of the data. The edges of the boxes

9. 2008 US election box plot

are the 25th and 75th percentile. The total height of the box contains the middle 50% of the data, and is called

10. 2008 US election box plot

the interquartile range, or IQR. The whiskers extend a distance

11. 2008 US election box plot

of 1-point-5 times the IQR, or to the extent of the data, whichever is less extreme. Finally, all points outside of the whiskers are plotted

12. 2008 US election box plot

as individual points, which we often demarcate as outliers. While there is no single definition for an outlier, being more than 2 IQRs away from the median is a common criterion. It is important to remember that an outlier is not necessarily an erroneous data point. You should not assume an outlier is erroneous unless you have some reason to. Since there is zero evidence of any substantial voter fraud in the United States, these outliers are not erroneous. They are just data points with extreme values. When the number of data is very large and bee swarm plots are too cluttered, box plots are a great alternative. It makes sense, then, that constructing a box plot

13. Generating a box plot

using Seaborn is exactly the same as making a bee swarm plot; we just use sns dot boxplot. And of course we never forget to label the axes.

14. Let's practice!

All right, let's go have some fun computing percentiles and making box plots!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.