1. Making a histogram
Previously, we learned how to make line plots, scatter plots, and bar charts. In this lesson, we'll learn how to make one last type of plot: a histogram.
2. Tracking down the kidnapper
Freddie Frequentist left a dirty shoe print at the scene of the kidnapping. In that shoe print, there were many tiny pieces of gravel. Our crime lab was able to give us a Pandas DataFrame with the radius of each piece of gravel. We want to compare the distribution of gravel radii to samples from the three sites where Freddie could be hiding Bayes. If the distributions match, then the gravel in the footprint came from that site, and Freddie is likely hiding there.
3. What is a histogram?
The best way of visualizing the distributions of gravel radii is by creating a histogram. A histogram visualizes the distribution of values in a dataset.
To create a histogram, we place each piece of data into a bin. In this example, we have divided our data into four bins: 1 - 10 mm, 11 - 20 mm, 21 - 30 mm, and 31 - 40 mm. The histogram tells us how many pieces of data (or pieces of gravel) fall into each of those bins. When we look at a histogram, we can quickly understand the entire dataset. Plotting a histogram in matplotlib is easy.
4. Histograms with matplotlib
We simply use the command plt-dot-hist. This function takes just one positional argument: our dataset. The output is shown on the right.
By default, matplotlib will create a histogram with 10 bins of equal size spanning from the smallest sample to the largest sample in our dataset. If we want to change the number of bins,
5. Changing bins
we can use the keyword argument "bins". Bins accepts one integer. In this case, we divide our histogram into 40 bins. This allows us to see more detail in our histogram. If we want to zoom in on a specific piece of our dataset,
6. Changing range
we can use the keyword "range" to set the minimum and maximum value for our histogram.
Note that we give the minimum and maximum values inside of a second set of parenthesis, separated by a comma. In this case, the minimum value is 50 and the maximum value is 100. Suppose we wanted to compare the distributions of weights
7. Normalizing
of male and female puppies. For some reason, we were able to collect many more samples of male puppy weights than female puppy weights. When we plot both histograms on the same axes, we can't actually see the difference in the distributions.
In this case, we don't actually care about the absolute number of male puppies with a given weight. Instead, we care about what proportion of the dataset has that weight.
We can solve this problem with normalization. Normalization reduces the height of each bar by a constant factor so that the sum of the areas of each bar adds to one. This would make our two histograms comparable, even if the sample sizes are different.
We can normalize our histogram by using the keyword argument density equals True.
Now each bar represents a proportion of the entire dataset. If a bar from the male puppies has the same height as a bar from the female puppies, both bars represent the same proportion of each population. Now that you've learned how
8. Let's practice!
to create histograms, let's practice by identifying where Freddy Frequentist is hidding!