Visualization in higher dimensions

1. Visualization in higher dimensions

In this course, we've been encouraging you to think about the question of "what is the association between this variable and that one" and "if you condition on one level of this variable, how does the distribution of another change". The answers to these questions require multivariate thinking and it is an essential skill in reasoning about the structure of real data. But why stop at only two variables?

2. Plots for 3 variables

One simple extension that allows you to plot the association between three variables is the facet grid. Let's build a plot that can tell us about msrp, the manufacturer's suggested retail price. Since that variable is numerical, there are several plots we could use. Let's go with a density plot. By adding a facet grid layer, we can break that distribution down by two categorical variables, separated by a tilde. Whichever variable you put before the tilde will go in the rows of the grid and the the one that goes after will form the columns. When we run this code, we get a grid of four density plots, one for every combination of levels of the two categorical variables. Unfortunately, this plot is difficult to interpret since it doesn't remind us which variable is on the rows versus columns. We can solve this by adding an option to the facet grid layer:

3. Plots for 3 variables

labeller is equal to label both. OK, now we can learn something. If we look at rear wheel drive pickups, there appear to actually be two modes, but in general, they're a bit cheaper than front wheel drive pickups. In non-pickups, however, its the rear-wheel drive ones that are generally pricier.

4. Plots for 3 variables

One thing we should check before moving on is the number of cases that go into each of these 4 plots. If we form a contingency table of rear wheel and pickup, we learn that there are relatively few rear wheel drive cars in this dataset. While this would be plainly obvious had we used histograms, density plots normalize each distribution so that they have the same area. The take home message is that our interpretation is still valid, but when we're making comparisons across the rear wheel variable, there are fewer cases to compare.

5. Higher dimensional plots

This is just the tip of the iceberg of high dimensional data graphics. Anything you can discern visually, things like shape, size, color, pattern, movement, in addition to relative location, can be mapped to a variable and plotted alongside other variables.

6. Let's practice!

Alright, now it's your turn to practice.