Get startedGet started for free

Visualizing summaries

1. Introduction

Hi, I'm Ryan Hafen, and I'll be your instructor. As a data scientist, I love exploring datasets and finding new insights, particularly with large and complex datasets. In this chapter, you will learn methods for visualizing summaries of large datasets, looking at interactions between variables, and visualizing subsets in detail.

2. Overview

The most natural place to begin exploring a large dataset is to find general high-level patterns by plotting distributions and summary statistics of each variable. Summarization reduces the data to a manageable size and can help you get a general understanding of the data before asking more detailed questions.

3. Summaries of one variable

For summaries of one variable, we will focus on three major types of variables - continuous, categorical, and temporal. We'll cover one popular scalable summary method for each variable type. Each of these methods involves greatly reducing the size of the data in a computationally scalable way. That makes it possible to use these methods on very large datasets!

4. Gapminder data

To illustrate the methods, we will use the gapminder dataset, which you may have seen in other DataCamp courses. This dataset provides indicators for 142 countries over 12 years.

5. Summaries of one variable: continuous

Let's start with the histogram. Histograms provide an easily interpretable way to visualize the distribution of a single continuous variable by splitting the range of the variable into bins and counting the observations that fall into each bin. You can create a histogram using ggplot2's geom_histogram() function. Here is a histogram of the Gapminder life expectancy variable. We see a distribution that looks like it has at least two modes. While the underlying dataset is small for this example, we could make a similar plot for a much larger dataset since the computation is simply counting how many observations fall into each interval.

6. Summaries of one variable: discrete

To visualize a single discrete variable, we can create a bar chart, which counts the number of records for each unique value of the variable. A bar chart can be created using use ggplot2's geom_bar() function. Here, we count how many observations we have from each continent in the data.

7. Summaries of one variable: temporal

A useful way to visualize a temporal variable is to bin by time and compute the number of observations (or some other summary) using dplyr's group_by() and summarise() functions, followed by a plot of the summary. Here, for example, we compute the annual median gross domestic product across all countries for each year and then plot it with ggplot2's geom_line(). At this point you may have noticed a theme of binning. A general scalable strategy for many summary visualizations of large data is to use dplyr's group_by() and summarise() functions in creative ways. dplyr is very fast and even scalable when backed by larger databases.

8. 1 Million NYC taxi rides

Speaking of larger datasets, let's now look at another dataset consisting of records of taxi ridership in New York City. This data contains a random sample of 1 million Yellow cab taxi rides from July to December of 2016.

9. Taxi data

For each taxi ride, data such as the pick-up date, trip duration, and information about the cab fare are available. We have chosen a subset of variables, as well as a random sample of records, for the sake of keeping the data size manageable so that you can receive immediate feedback in the interactive exercises that follow. The same code used in this chapter can be applied to the full dataset on your own computer.

10. Let's practice!

Let's put what we've learned to work by visualizing summaries of these 1 million data points!