1. The summarize verb
In this chapter, you'll return to the topic of data transformation with dplyr to learn more ways to explore your data.
2. Data transformation and visualization
Analyses will usually involve a cycle between these steps of data transformation and visualization, as well as additional components of the data science workflow, like modeling, that you'll learn about in other DataCamp courses.
Once you've learned these new verbs, you'll be able to create a much larger variety of informative visualizations with ggplot2. You've learned to use the
3. Extracting data
filter verb to pull out individual observations, such as statistics for the United States in 2007. Now you'll learn how to summarize many observations into a single data point.
4. The summarize verb
For example, suppose you want to know the average life expectancy across all countries and years in the dataset.
You would do this with the summarize verb. Take your gapminder data, pipe it into summarize, and specify that you're creating a summary column called meanLifeExp.
The "mean parentheses lifeexp end parentheses" there is worth examining. This is calling the function mean on the variable lifeExp. The mean function takes the average of a set of values, and R comes with many built-in functions like this.
Notice that summarize collapses the entire table down into one row. In the output, we see the answer to our question: the mean life expectancy was about 59.47 years.
If you think about it, it doesn't really make sense to summarize across all countries and across all years. It may make more sense to ask questions about averages in a particular year, such as 2007.
5. Summarizing one year
To answer this, you can combine the summarize verb with filter: filter your data for a particular year first, then summarize the result. This shows you that the average life expectancy in the year 2007 was about 67 years. You can create multiple summaries at once with the summarize verb.
6. Summarizing into multiple columns
For example, suppose that along with finding the average life expectancy in 2007, you want to find the total population in that year.
To do that, you add a comma after the mean of the life expectancy, and specify another column that you're creating. You could give it a useful name like totalPop, and say that it's equal to the sum- that's another built-in function- of the pop variable.
7. Functions you can use for summarizing
mean and sum are just two of the built-in functions you could use to summarize a variable within a dataset. Another example is median: the median represents the point in a set of numbers where half the numbers are above that point and half of the numbers are below. Two others are min, for minimum, and max, for maximum. In the exercises, you'll use several of these functions to answer questions about the gapminder dataset.
8. Let's practice!