1. Exploring data with dplyr
Hi! Welcome to the course! I'm James, and together, we'll discover dplyr's power for exploring, transforming, and aggregating data.
2. The dplyr package
The dplyr package is one of several packages included in the Tidyverse collection.
dplyr provides a ton of functionality to quickly manipulate and transform datasets.
dplyr can be installed on its own, or with the other Tidyverse packages by installing tidyverse.
3. Chapter 1 verbs
In this chapter, you'll learn to use four dplyr verbs to explore and transform a dataset.
The four verbs are select(), filter(), arrange(), and mutate(). By the end of this chapter, you'll be comfortable using these verbs in various combinations.
4. 2015 United States Census
Throughout this course, we'll work with a real dataset, where you'll not only be able to practice the dplyr transformation verbs but also learn how to explore and draw insights from data. This particular dataset is from the 2015 United States Census.
5. United States counties
A state is one of 50 regions within the United States, such as New York, California, or Texas. A county is a subregion of one of those states, like Los Angeles county in California.
6. counties dataset
The US census data at the county-level has been loaded in the counties tibble, which we can access by typing counties into the console.
This dataset contains loads of information, but don't worry, we're only going to work with a few variables, or columns, at a time! This table includes information about people living in each county, such as the population, the unemployment rate, their income, and demographic information, so there are a lot of interesting questions we can answer with this data.
There are 40 variables in the counties dataset, and only the first few are previewed in the tibble.
7. glimpse()
The glimpse() function can be used to view the first few values from each variable, along with the data type, which is a useful first step in understanding the data.
8. select() verb
Datasets often come with more variables than we need, and we're not going to need all of them in any one analysis. Let's keep only a few variables: the state, the county, the total population, and the unemployment rate.
We can do this using the select() verb. select() extracts only particular variables from a dataset. Using dplyr syntax, we can type counties, then the pipe operator, which is two percent signs with a greater-than arrow in between. Then, we select the variables of interest by calling the select function, and passing the variable names separated by commas.
The output contains only the variables we passed to select().
9. Creating a new table
Sometimes, we want to keep the data we've selected for use further down the line. We can use assignment to store this new table. Recall that we assign objects to variables using the arrow operator, written as "less than dash".
10. Printing the dataset
This new table, counties-underscore-selected, can now be accessed or used. We can print that dataset just as we did the first one.
11. Let's practice!
Let's practice exploring the counties dataset!