Get startedGet started for free

Split-Apply-Combine

1. Split-Apply-Combine

Simple tabulations don't always give us the answers we are looking for.

2. Split-Apply-Combine

To get the ones we want, we may need to group or split data ourselves, perform our own calculations, and return the results. We'll refer to this generically as split compute and combine but it goes by other, similar names including divide and recombine, split apply combine, and even software alchemy when it's used in statistical analyses. In this section you'll learn how to create more sophisticated summaries using split(), Map() and Reduce().

3. Partition using split()

In the first step - the split. Rows of a data set are partitioned. These partitions could be random or they could be levels from a categorical variable in the data set we're working with. Let's use R's split() function to partition data. The first argument to the split function is a vector or data frame containing the data that will be split. The second argument is a factor variable, whose length should be the same as the first argument's. Each level of this factor variable corresponds to one of the partitions.

4. Partition using split()

The split() function returns a named list. Names correspond to levels of the factor variable used to partition the data. Each element of the list contains the partition data we split over. In this case, we're splitting the rows of the mortgage data by year. The result is a list, whose names are years and whose elements contains the rows corresponding to mortgages.

5. Compute using Map()

Now that we have the data partitioned, we'll compute on each partition. In this second step, each data partition is sent to a function that processes the data. Our example uses the Map() function to do this. Map() takes two arguments. The first is a function that will be applied to each element of the list supplied in the second argument.

6. Compute using Map()

In this case the function counts the number missing values for each column of the mortgage data, by year. The result of the Map() function is another named list where the names once again correspond to the levels of the factor variable. Each element of the list contains whatever was calculated in the Map() function. In this case, a vector of numeric values with the number of missing values per column, per year.

7. Combine using Reduce()

Now that we have the results in a list we may want to combine them so that they can be used in subsequent analyses. This is often done with the Reduce() function. Like Map(), Reduce() takes a function and then a list. The function describes how a result is combined.

8. Combine using Reduce()

Returning to our example, suppose we want get the total count of missing values by column. We can do this by reducing the list returned by Map() using the plus operator, which adds up the vectors. We could also use Reduce() to create a new matrix with the count of missing values for each column by year. In this case we Reduce() the list using rbind(), resulting in a matrix where each row corresponds to a year and each column corresponds to a mortgage variable.

9. Let's practice!

Now that you've seen how to perform split-compute-combine operations with R we'll practice them with the mortgage data.