Exploring numerical data

1. Exploring numerical data

In this chapter, we'll broaden our tool box of exploratory techniques to encompass numerical data. Numerical data are data that take the form of number, but where those numbers actually represent a value on the number line. The dataset that we'll be working with is one that has information on the cars that were for sale in the US in a certain year.

2. Cars dataset

We can use the structure function, s-t-r, to learn more about each of the variables. We learn that we have 428 observations, or cases, and 19 variables. Unlike most displays of data, the structure function puts each of the variables as a row, with its name followed by its data type, followed by the first several values. The car names are character strings, which are like factors, except its common for every case to take a unique value. L-o-g-i, that's for logical variables, another simple case of a categorical variable where there are only two levels. For example, each car will take either TRUE or FALSE depending on if it is a sports car. We can see that the last set of variables are all either i-n-t for integer or n-u-m for numerical. They're actually both numerical variables, but the integers are discrete and the numerical is continuous. If you look at ncyl, that's the number of cylinders, it's listed as an integer, but there are only a few different values that it can take, so it actually behaves a bit like categorical variable. Let's construct some plots to help us explore this data.

3. Dotplot

The most direct way to represent numerical data is with a dot plot, where each case is a dot that's placed at it's appropriate value on the x axis, then stacked as other cases take similar values. This is a form of graphic where there is zero information loss; you could actually rebuild the dataset perfectly if you were given this plot. As you can imagine, though, these plots start to get difficult to read as the number of cases gets very large.

4. Histogram

One of the most common plots to use is a histogram, which solves this problem by aggregating the dots into bins on the x axis, then mapping the height of the bar to the number of cases that fall into that bin. Because of the binning, it's not possible to perfectly reconstruct the dataset: what we gain is a bigger picture of the shape of the distribution. If the stepwise nature of the histogram irks you, then you'll like the density plot.

5. Density plot

The density plot represents the shape of the histogram using a smooth line. This provides an ever bigger picture representation of the shape of the distribution, so you'll only want to use it when you have a large number of cases. If you'd prefer a more abstracted sense of this distribution, we could identify the center of the distribution,

6. Density plot

the values that mark off the middle half of the data,

7. Density plot

and the values that mark off the vast majority of the data.

8. Boxplot

These values can be used to construct a boxplot,

9. Boxplot

where the box represents the central bulk of the data,

10. Boxplot

the whiskers contain almost all the data,

11. Boxplot

and the extreme values are represented as points. You'll see the syntax for this is a bit different: we'll discuss why later on in the chapter.

12. Faceted histogram

Let's use a histogram to look at the distribution of highway mileage faceted based on whether or not the car is a pickup truck by adding a facet wrap layer. It gives us a message, letting us know that it has picked a binwidth for us and a warning that there were 14 missing values. The plot that it provides is informative: it's clear that are many more non-pickups than pickups.

13. Faceted histogram

It also shows that the typical pickup gets much lower mileage than the typical non-pickup.

14. Faceted histogram

We also see that non-pickups have more variability than do the pickups.

15. Let's practice!

Keep an eye on these two components: a typical observation and the variability of a distribution as you practice exploring this numerical data.