The kernel density estimator

1. The kernel density estimator

Now that we've seen where and how histograms can fail, what are our alternatives? There are many, but one of the most popular and is something called the Kernel density estimator or KDE.

2. Where histograms struggle

Let's first go back over some of the situations where histograms struggle. They are sensitive to bin width changes which can manifest in multimodal situations such as datasets with digit preference. ... Additionally, selecting a proper bin-width when you have small amounts of data is tricky. Requiring you to either make the bin-width so wide you're left with a blocky mess or having your histogram look like some abstract barcode with lots of gaps that may just be the result of not having enough samples rather than true gaps in the distribution.

3. Kernel density plots

Whereas histograms work by setting thresholds and then counting the number of points that fall within that bin, KDE's work the other direction. They start at each data point and a 'kernel' function on top. Then these kernel functions are summed up to come up with the shape of the distribution. The kernel can have any shape, sometimes it's a uniform distribution which results in a plot that looks a lot like a histogram, but more frequently, a Gaussian kernel is used. This means little normal distributions are added for each point and stacked on top of each other. When we refer to kdes for the rest of this course we will mean the gaussian kernel version.

4. Making a KDE in ggplot

Unsurprisingly, making a KDE in ggplot is simple. You take what you would have called with geom_histogram and replace it with geom_density. ggplot will automatically deal with all the details like adding up your kernels. Here we are looking at the percentage over the speed limit for a random 100 observations of our speeding data to simulate a small data scenario. One change that you will likely want to make to the default is setting a fill parameter so your plot isn't just a line.

5. Making a KDE in ggplot

The plot is nice and clear. We can see that there is a strong peak at a little less than 50% over the speed limit.

6. A new width to worry about

Since you are not binning any points, you don't have a bin width to worry about, but you do have to choose how wide you want the kernels. The larger the standard deviation, or wider, the kernel you choose the more smooth your plot will become. Go too wide and you can totally fill in parts of your x-axis that may have meaningful gaps or dips in them. A good best practice is to think about what a meaningful kernel width is in the context of your data and report the width with your plot.

7. Kernel size animation

Here's the same data we looked at for histogram bin widths but this time modifying the kernel width instead of the bin-width. We see that the plot does get clearer as the width decreases but it eventually becomes unrealistically bumpy.

8. Show all the data

In addition to being thoughtful and transparent about your kernel widths, a great addition to KDEs is what is called a 'rug-plot'. This is a series of dashes exactly where each data point falls on the x-axis, allowing the user to see where the KDE is filling gaps. Adding this is as easy as adding geom_rug to your code.

9. Show all the data

We can now see our plot is making assumptions about the density between around 80 and 90 percent and decide if we (or let our reader decide) agree or not.

10. Let's stack some gaussians!

Now that you're well versed in KDEs, let's make some!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.