Distribution of one variable

1. Distribution of one variable

If you're interested in the distribution of just a single numerical variable, there are three ways you can get there. The first is to look at the marginal distribution, like,

2. Marginal vs. conditional

for example, the simple distribution of highway mileage. If we want to look at the distribution on a different subset of the data,

3. Marginal vs. conditional

say cars that are pickup trucks, we can add a facet wrap layer to see the distribution for both pickups and non-pickups.

4. Building a data pipeline

There's another scenario, though, in which we'd want to look at the distribution of this variable on a more specific subset of the data, say the cars which have engines less than 2 (point) 0 liters in size. Since engine size is numerical, it won't work to simply use facets. Instead, we have to filter. Filter is a function in the dplyr package used to keep only the rows that meet a particular condition. In this case, we want the rows where the engine size variable is less than 2 (point) 0. Notice that we're using the pipe operator, which takes the output of whatever is before it, and pipes it as input into the next function. Then we save this filtered dataset into a new dataset called cars2. The second step is then to construct the plot using this new dataset. This construction is a bit inefficient though, since we save this intermediate dataset, cars2, which we're not really interested in.

5. Building a data pipeline

We can solve this by linking this two components into a continuous data pipeline. We start with the raw data, which we pipe into the filter, the result of which gets piped into the ggplot function, which then adds a layer to complete the plot. This is a powerful and very general paradigm: you can start with a raw dataset, process that dataset using dplyr linked by pipes, then visualize it by adding up layers of a ggplot.

6. Filtered and faceted histogram

Let's run that code in the console. The resulting plot makes some sense. These are cars with small engines that we're looking at and small engines are usually more efficient, so we're seeing higher mileages than when we looked at the whole dataset. One thing that's important to know about histograms like this one is that your sense of the shape of the distribution can change depending on the bin width that is selected.

7. Wide bin width

ggplot2 does its best to select a sensible bin width, but you can override that option by specifying it yourself. If we use a binwidth of 5, the result is a histogram that's much smoother. The same principle holds for density plots.

8. Density plot

Let's pull up a density plot for the same data. It looks reasonably smooth, but if we wanted to make it smoother, we can increase what's known as the bandwidth of plot.

9. Wide bandwidth

When we increase that to 5, we get a plot that smooths over the blip on the right side a bit more. But how do we decide what the "best" binwidth or bandwidth is for our plots? Usually the defaults are sensible, but it's good practice to tinker with both smoother and less-smooth versions of the plots to focus on different scales of structure in the data.

10. Let's practice!

Let's try putting these techniques into practice.