Importance of distributions

1. Distributions: part one

In the next two chapters we're going to be delving into an extremely important topic - visualizing distributions.

2. What is distribution data?

As a data scientist, you will most likely deal with distribution data frequently. Anytime you can take multiple readings of something and get different results you're dealing with data that has some underlying distribution. This could be the act of measuring user engagement with a product. The result may be 1 second one time, 5 seconds another, etc. Whether these differences come from measurement error or just inherent differences in the subjects, getting a good grasp on the underlying distribution driving the observations is invaluable to your data science workflow.

3. Why distributions are important

There are many many reasons why visualizing data distribution is important, but three ones that stick out to me as a data scientist are as follows: One) You can catch errors in your data. Maybe you accidentally only included observation greater than one, a simple mean median summary may not reveal this. ... Two) Perhaps you see two strong peaks in your data, this may be an indicator that if you are modeling you should look for a controlling variable that is creating these seperate peaks ... Three) Showing the distribution of the data is simply more accurate and truthful to the data than compressing it into a summary statistic or two. If you have the space, you should do it.

4. Standard plots

Traditionally when looking at distribution data there are two main types of visualizations people use based on two different scenarios motivating the visualization. The first visualization is the histogram. The histogram attempts to show the data's density by using bars covering a given bin of the data range with the height corresponding to the number of observations in that range. Histograms are typically used when the goal is to investigate the shape of a single distribution. In this chapter, we will look at histograms and their alternatives. ... The second visualization and scenario are box-plots for comparing multiple distributions. boxplots provide a series of lines corresponding to the 25th percentile, median, and 75th percentile, along with lines extending 1.5 times interquartile range and dots to visualize outliers. We will talk about boxplots and their alternatives in the next chapter.

5. Maryland speeding data

For this chapter we're going to be working with a new dataset. For each exercise you will complete the data frame md_speeding will be loaded. This data comes from Montgomery County, Maryland and contains all the traffic tickets given for speeding in 2017. Along with the speed limit and driver's speed, we have info on the driver's race, gender and home state, along with the color, type, and model year of the car they were driving.

6. Making a histogram in ggplot2

To demo the data and also the type of plots you will be making, let's put together a rather silly histogram of the speed that the blue cars were going when pulled over. You may have a blue car and want to know how fast you should go to be normal. Like many things in ggplot, this is rather easy. Since a histogram only needs a vector of continuous values all we need to do is pass our x aesthetic mapping as the column of our data frame we want to histogramize. In this case: speed.

7. Making a histogram

Here we can see that our data appear to have a slight skew to the right with the highest peak being around 50 miles per hour, with a few particularly scary drivers going more than 100 miles an hour.

8. Let's make some histograms!

Let's get familiar with our new data by making some histograms in the exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.