Get startedGet started for free

Visualizing subsets

1. Visualizing subsets

We have seen that visualizing summaries can help us discover and describe overall trends and relationships in the data. In this section, we will explore some examples of how we can complement summary visualizations with detailed visualizations of smaller subsets of our data.

2. Visualizing subsets in detail

While summary visualizations can be very revealing, sometimes important insights are covered up in the summarization and we need to look at the data in more detail to discover them. For example, here we have a summary visualization of the annual return for four stocks. From the summary, it appears all four stocks had a similar year - a return of about 13%. However, looking at detailed plots of the daily prices, we see a very different year for each stock. We'll have more fun with this stock data in chapter 3. Visualizing large data in detail is challenging because there's too much data to look at! A useful technique in this case is to take a manageable subset of the data that has some natural meaning (such as, for example, all data for one stock), and visualize and explore.

3. Investigating the tip amount distribution

As we saw in one of our previous exercises, the distribution of the tip amount is zero for all payment types but credit card. This is an interesting phenomenon that we want to get to the bottom of. With cash payments, does the taxi payment system not distinguish between tips and fare? Or does the total fare amount just not include the amount that was tipped?

4. A subset of the taxi data

To investigate this question, we turn to detailed visualization of a subset of our data. We expect rides of the same nature to have similar fare and tip amounts. Therefore, if we can pull out a subset of our data for similar routes, we can compare the distributions of fare and tip amount to investigate our question. We expect the distributions of total fare for rides paid with cash and card to look similar if both cases include tips. Here, we have extracted a subset of the data for the most popular route, from the Upper East Side South to the Upper East Side North of Manhattan. Looking only at these trips and only at cash and credit transactions, we have about 5,000 observations.

5. Total fare vs. trip duration

Let's do a check to ensure that this subset is well-behaved. Looking at the relationship between total fare vs. trip duration, we expect the relationship to be cleaner since we are focusing on one simple route. Even with data this small, we are still overplotting many points, and we can alleviate this to a degree using the alpha parameter to add transparency to the points. This looks much cleaner than when what we saw for all routes.

6. Cash / card distribution comparison using a quantile plot

To compare the distribution of payments using card vs. cash, we can use a quantile plot. This displays the ordered values of the data against the quantiles of a uniform distribution, and is often more useful than a histogram for comparing distributions. We create a quantile plot using ggplot2's geom_qq() specifying that the data should be plotted against the uniform distribution. In this plot, we see that the card and cash distributions have a similar shape but are shifted. We also see that the cash payments are made up of several discrete values while card payments are more continuous (which we wouldn't be able to see in a histogram). From this, we can reasonably conclude that tips are not included in the total reported fare amount for cash payments. In the exercise, we will see if the two distributions are similar if we remove tips from both.

7. Let's practice!

Let's go!