1. Histogram nuances
Now that you're familiar with the basics of making histograms in ggplot and our new data we can move on to the nuances of histograms that most people ignore.
2. Histogram positives
Before we get into the details of improving your histograms let's breifly take a step back and look at what is good about them and what's bad.
Starting with the good: They are intuitive to the reader (notice a theme in this course)? A density is something that most people understand, the higher the bar the more frequent, or probable, a value in that range is.
...
They are also interpretable. In their default form, the y-axis level of a bar corresponds to how many points in your data fell within its x-bin range. This is nice and easy to understand.
3. Histogram negatives
Now to the bad. You may have noticed in the previous exercises that ggplot bugged you about the bins when you plotted with plain geom_histogram. The default is 30, but ggplot intelligently asks you to fiddle with this. This is because, as we will see, how many bins you have and where their boundaries fall can make a big difference in the plot you get.
...
This is especially a problem when combined with the second con, that when you don't have much data, there tend to be gaps in the histogram that could be interpreted as the data not falling on a given range, but could also just be due to a small sample size. I call these barcode plots, and they're not very helpful.
4. Adjusting number of bins
To see why ggplot is bugging you about adjusting your bins, let's take a look at an example. In this animation, we are scaling the bin number from 10 to 55 while maintaining the exact same underlying data.
Out of context you may look at any one of these frames and make some inference about the shape of the distribution, which would probably be incorrect. So how do we avoid the problems like we see in this animation?
5. Bin number best practices
There are plenty of efforts to fix this issue but ultimately they make assumptions about the underlying data distribution. Which, given we are visualizing our data because we aren't sure of its underlying distribution, can be ill-advised.
A good rule of thumb is: if you have more than around 150 points, set your number of bins to 100 and call it a day. Most likely this will make your grid fine enough that adding any more bins won't gain you much advantage due to diminishing returns.
For smaller numbers of observations, playing around with the bin numbers to get a sense of the patterns and then choosing for your final value as the best (or showing multiple) is generally the best path.
6. Reality
To be fair to the histogram, the example shown in the animation is a contrived one. The data is simulated as a mixture of 20 small-variance normal distributions. However, in real life data like these do exist in the form of digit preference. Blood pressure for instance often has large peaks at the round numbers just because people _like_ round numbers.
...
If your data are less likely to exhibit this type of pattern, as is often the case in automatically captured data, your histogram most likely won't be misleading, however, it still is important to set the number of bins carefully.
7. Let's improve some histograms!
Okay, now that you're up to speed on the ins and outs of histograms, let's step away from the defaults and customize some plots.