Get startedGet started for free

That has crossed the line

1. That has crossed the line

We'll now extend our knowledge of mutate, summarize, and count to go across many columns. First, recall the mutate function syntax.

2. Creating a new column based on another

The mutate function is commonly used to produce a new column, computing along each row. With world_bank_data, suppose instead of a percentage value for perc_college_complete, we wanted a proportion between 0 and 1 too. Dividing by 100 produces prop_perc_college_complete on this scale. The dot-keep argument controls which columns will be present in the resulting tibble. The default of dot-keep is "all", which includes new columns as well as all previous. Setting to "used" returns only those columns used in the calculation, which is helpful to double-check our work directly.

3. Say hello to across()

The across function works inside of mutate to allow us to perform the same calculation across multiple columns. Let's try this on the perc_college_complete column first, passing the column name to dot-cols. We specify what function we'd like to apply to the columns with dot-fns, making sure to include the tilde and the dot-x if we'd like to create a custom function rather than use a built-in function like max or mean. Lastly, we specify what kind of naming we'd like the new columns to have with dot-names. The original column name is stored in the dot-col with curly brackets syntax. For dot-names, prop_ is prepended to this original column name. The output here is the same as we saw using mutate without across.

4. Computing across multiple columns

To extend this to go through multiple columns, update the dot-cols argument to be those columns to compute across. To divide all the columns that start with perc by 100, we can use the starts_with function.

5. The new prop columns

All four columns starting with "perc" now have a proportion counterpart, starting with "prop".

6. Tweaking column names

The column names are a little much here with the prepended prop. We can use the base R sub function to replace column names matching the pattern prop_perc with prop. sub takes the pattern to replace as the pattern argument, the replacement text as replacement, and the vector to do the substitution on as x.

7. across() with summarize()

Let's combine summarize and across to find median rates for all countries listed in 2015. The result has only one row and a column for each rate. First, we filter data from 2015. Next, we call across inside summarize. dot-cols is columns with names ending in rate, dot-fns is median, and we prepend median to the name of each column. These are the median rates for all countries in the data for 2015.

8. count() how many rows are in each combination

The count function gives how many rows are in all combinations of variables. Separating the column names with commas in count specifies which variable combinations we are interested in. Coming back to world_bank_data, let's look at combinations of country and continent. Australia and Oceania refer to four years here, whereas Angola and Africa only has one entry.

9. count() with across() and introducing where()

This listing of columns with commas is OK if we only have a few variables of interest, but it can be tedious if we want to go across, say, all non-numeric columns. Let's do that here using across inside of count. The dot-cols argument is which columns to choose. The where function from tidyselect chooses columns based upon conditions. Passing is-dot-numeric to where and placing an exclamation point at the front will count across all non-numeric columns.

10. Sorted result

Setting count's sort argument to TRUE, we see Portugal has the highest number of non-numeric rows.

11. Let's practice!

Test out these talents on the IMF data!