Percentiles, outliers, and box plots
1. Percentiles, outliers, and box plots
The median is a special name for the 50th percentile;2. Percentiles on an ECDF
that is 50% of3. Percentiles on an ECDF
the data are less than the median. Similarly, the 25th percentile4. Percentiles on an ECDF
is the value of the data point that is greater than 25% of the sorted data, and so on for any5. Percentiles on an ECDF
other percentile we want. Percentiles are useful summary statistics, and can be computed6. Computing percentiles
using np dot percentile. We just pass a list of the percentiles we want (percentiles, not fractions), and it returns the data that match those percentiles. We can do this for all of the swing states. Let's compute the 25th, 50th, and 75th percentiles. We now have three summary statistics. Now the whole point of summary statistics was to keep things concise, but we're starting to get a lot of numbers here. Dealing with this issue is where quantitative EDA meets graphical EDA.7. 2008 US election box plot
Box plots were invented by John Tukey himself to display some of the salient features of a data set based on percentiles. Here, we see a box plot showing Obama's vote share from states east and west of the Mississippi River. The center of the box is the median,8. 2008 US election box plot
which we know is the 50th percentile of the data. The edges of the boxes9. 2008 US election box plot
are the 25th and 75th percentile. The total height of the box contains the middle 50% of the data, and is called10. 2008 US election box plot
the interquartile range, or IQR. The whiskers extend a distance11. 2008 US election box plot
of 1-point-5 times the IQR, or to the extent of the data, whichever is less extreme. Finally, all points outside of the whiskers are plotted12. 2008 US election box plot
as individual points, which we often demarcate as outliers. While there is no single definition for an outlier, being more than 2 IQRs away from the median is a common criterion. It is important to remember that an outlier is not necessarily an erroneous data point. You should not assume an outlier is erroneous unless you have some reason to. Since there is zero evidence of any substantial voter fraud in the United States, these outliers are not erroneous. They are just data points with extreme values. When the number of data is very large and bee swarm plots are too cluttered, box plots are a great alternative. It makes sense, then, that constructing a box plot13. Generating a box plot
using Seaborn is exactly the same as making a bee swarm plot; we just use sns dot boxplot. And of course we never forget to label the axes.14. Let's practice!
All right, let's go have some fun computing percentiles and making box plots!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.