Get startedGet started for free

Plot all of your data: ECDFs

1. Plot all of your data: ECDFs

We saw in the last video the clarity of bee swarm plots.

2. 2008 US swing state election results

However, there is a limit to their efficacy. For example, imagine we wanted to plot the county-level voting data for all states

3. 2008 US election results: East and West

east of the Mississippi River and all states west. We make the swarm plot as before, but using a DataFrame that contains all states, with each classified as being east or west of the Mississippi. The bee swarm plot has a real problem. The edges have overlapping data points, which was necessary in order to fit all points onto the plot. We are now obfuscating data. So, using a bee swarm plot here is not the best option. As an alternative,

4. Empirical cumulative distribution function (ECDF)

we can compute an empirical cumulative distribution function, or ECDF. Again, this is best explained by example. Here is a picture of an ECDF of the percentage of swing state votes that went to Obama. A x-value of an ECDF is the quantity you are measuring, in this case the percent of vote that went to Obama. The y-value is the fraction of data points that have a value smaller than the corresponding x-value.

5. Empirical cumulative distribution function (ECDF)

For example, 20% of counties in swing states had 36% or less of its people vote for Obama.

6. Empirical cumulative distribution function (ECDF)

Similarly, 75% of counties in swing states had 50% or less of its people vote for Obama.

7. Making an ECDF

Let's look at how to make one of these from our data. The x-axis is the sorted data. We need to generate it using the NumPy function sort, so we need to import Numpy, which we do using the alias np as is commonly done. The we can use np dot sort to generate our x-data. The y-axis is evenly spaced data points with a maximum of one, which we can generate using the np dot arange function and then dividing by the total number of data points. Once we specify the x and y values, we plot the points. By default, plt-dot-plot plots lines connecting the data points. To plot our ECDF, we just want points. To achieve this we pass the string period and the string 'none' to the keywords arguments marker and linestyle, respectively. As you remember from my forceful reminder in an earlier video, we label the axes. Finally, we use the plt dot margins function to make sure none of the data points run over the side of the plot area. Choosing a value of point-02 gives a 2% buffer all around the plot.

8. 2008 US swing state election ECDF

The result is the beautiful ECDF I just showed you. We can also easily plot multiple ECDFs on the same plot.

9. 2008 US swing state election ECDFs

For example, here are the ECDFs for the three swing states. We see that Ohio and Pennsylvania were similar, with Pennsylvania having slightly more Democratic counties. Florida, on the other hand, had a greater fraction of heavily Republican counties. In my workflow, I almost always plot the ECDF first. It shows all the data and gives a complete picture of how the data are distributed.

10. Let's practice!

But don't take my word for how great ECDFs are. You can see for yourself in the exercises!