Get startedGet started for free

Census data wrangling with tidy tools

1. Census data wrangling with tidy tools

tidycensus is designed to return Census data ready for use within tidyverse data science workflows. In this section, we'll discuss how to use tidycensus functions to accomplish practical data analysis tasks with US Census data.

2. The tidyverse

The tidyverse refers to a series of R packages that are designed to work together in a data science workflow. These packages include dplyr for data wrangling; purrr for functional programming; ggplot2 for data visualization; and tidyr for data reshaping, among others. tidycensus uses several of these packages under the hood to process data from the Census API for you and is designed to return data in a format that can be processed by tidyverse functions.

3. Group-wise census data analysis

Key to tidyverse workflows is the "split-apply-combine" model of data analysis. This involves identifying natural groups in a dataset, which might refer to unique values in a given column; splitting the dataset by these groups; applying a function to each of the groups; then combining the result back into a dataset to allow for comparisons. In the example on this slide, we are performing such an analysis using a chain of functions, linked by the pipe operator. Here, we are using tidyverse functions to identify the largest racial or ethnic group in each county. We first group by GEOID, which uniquely identifies each county, then we filter the dataset for those rows in which the estimate is equal to the largest estimate for each group, giving us back the group maximum.

4. Group-wise census data analysis

We can summarize the results with the tally() function after grouping by variable. In 67 counties in Texas, Hispanic is the largest race or ethnicity, and non-Hispanic white is the largest of the other 187.

5. Recoding variables for group-wise analysis

Using tidyverse tools, analysts can also create their own groups for group-wise data analysis. The example on this slide uses the household income dataset for counties in Washington acquired earlier in this chapter. By default, table B19001 returns relatively fine income bands. However, an analyst might want to combine some of these categories and recode into broader groups. This is accomplished with dplyr's case_when() function. Once these new groups are defined, the new group variable, incgroup, can be passed to the group_by() function, allowing the analyst to compute group sums with summarize().

6. Iterating through years with purrr

purrr, a core package in the tidyverse, includes a wide variety of tools for functional programming. purrr is especially useful for iteration, which refers to the repetition of a process for a series of values. For ACS data, purrr can be used to acquire data for multiple years, and combine those datasets into a single output dataset. In the example shown on the slide, we use the map_df() function to iterate through a vector of years - 2012 through 2016 - and call the get_acs() function for each year. We can then use the result to examine demographic changes over time, using cities in Michigan as an example. We see here that Ann Arbor, Michigan has been growing in population, while Dearborn, Michigan has been declining.

7. Let's practice!

Now, let's try out these examples in R!