1. Be fruitful and dplyr
Hello. I'm Chester, and welcome to this course.
2. Course prerequisites
This course builds on the skills from these other DataCamp courses working with dplyr and writing functions. Make sure you have completed each of these before progressing.
3. Course outline
Throughout the course, we are going to learn tips and strategies to improve our dplyr toolbox and to multiply our programming skills with the tidyverse.
In Chapter 1, we'll briefly revisit dplyr functions before focusing on selecting columns based on patterns.
In Chapter 2, we'll relocate columns in our data and perform transformations across many columns.
In Chapter 3, we'll shift from working with a single data source to ways to work with multiple data sources programmatically using joins and set theory clauses.
In Chapter 4, we'll close the course by creating functions to customize and repeat dplyr and ggplot2 code using the powerful rlang package.
4. The world_bank_data tibble
The world_bank_data tibble is one of the datasets we'll work with. The World Bank supplies data looking at trends at the country level over time. A subset of this data is outputted here for us. For this course, we'll examine country properties such as infant mortality rate.
5. world_bank_data columns
The names function gives the column names of a tibble or data frame. Here we can see the names of the 12 columns of world_bank_data.
6. Select some columns from world_bank_data
Recall the select function expects the columns we'd like to choose as unquoted arguments separated by commas.
We can see the result of choosing a few columns here.
7. Filter rows to match continent values
If we'd like to return only rows that match certain criteria, remember we use the filter function.
Suppose we are interested in only a portion of this data: those corresponding to entries on countries in Africa or Asia. We can include both of these continents in a new vector called continents_vector and then filter based on this.
To filter, we'll select the columns as before and pipe them into the filter function. We want to return rows where continent is either the value of Africa or the value of Asia. This can be done with the percent-in-percent operator with continents_vector on the right and continent on the left. We assign this result to asia_africa_results.
8. Results of row filter
Looking at the results, we see that we have only returned rows for those countries in Africa or Asia. Note also that the number of rows here has reduced to 111 from the 300 in world_bank_data.
9. Mutate a new column
To the asia_africa_results subsetted tibble, we can add another column useful for further analysis.
Recall that the mutate function can create a new column based either on combining multiple columns or on doing a calculation with a constant value. Each country's population is split between rural and urban areas. The data contains the perc_rural_pop column for rural population but does not have one representing urban population. We can create perc_urban_pop by subtracting perc_rural_pop from 100.
10. Results of mutate
We can now see results about the urban versus rural split of countries in different years. Interestingly, Singapore has a perc_urban_pop value of 100, meaning that it has an entirely urban population.
11. Analyze urban percentage across regions
Finally, we can use the group_by and summarize functions to see how this urban population rate differs across regions in Africa and Asia.
We first group_by the region column.
Then we use the mean function inside of summarize to calculate the mean urban population percentage.
The resulting tibble is at the region level. South-Eastern Asia is predominantly urban, and Eastern Africa is much more rural on average.
12. Let's practice!
Time to refresh your dplyr knowledge on a dataset from the International Monetary Fund.