Get startedGet started for free

The Bigmemory Suite of Packages

1. The Bigmemory Suite of Packages

You've made it to chapter 2 of the course.

2. So far ..

Now that you know the basics of how to import, subset, and assign values for big matrix objects, we are going to move onto exploratory data analysis using bigmemory. In this chapter, you'll learn how to create tables and summaries that let you see structure in the data.

3. Associated Packages

The bigmemory package is not stand-alone. It is part of a suite of packages that make use of bigmemory for processing big matrix objects. These packages include biganalytics for summarizing, bigtabulate for splitting and tabulating,

4. Associated Packages

and bigalgebra for linear algebra operations.

5. Associated Packages

Other contributed packages fit models with big matrix object. These include bigpca for principal components, bigFastLM for linear models, biglasso for penalized linear and logistic regression, and bigrf for random forests. For the rest of this section we will focus on how to summarize and tabulate data using biganalytics and bigtabulate.

6. The FHFA's Mortgage Data Set

You may have noticed when we were reading data in the last chapter we were working with a file called "mortgage-sample.csv". This is a publicly available data set from the Federal Housing Finance Agency chronicling all mortgages that were held or securitized by both the Federal National Mortgage Association (Fannie Mae) and the Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015. The full data set includes 10's of millions of mortgages along with demographics and financial information about the individual lenders. The data set is available online and, by analyzing these data, we may be able to understand the disparity in homeownership among groups from different backgrounds, assess the risk of default, or even detect events like the 2008 housing market crash. The entire data set is a total of 2-point-7 gigabytes in size. While R can read this in as a data frame and still be less than 10-20% of the size of RAM on your machine, it is above this threshold for the virtual machines that run the exercise code. Furthermore, running some of the exercises may take a little longer than you'd like to wait. So, if you'd like to run the code on the entire data set after this course, please feel free to do so by downloading from the link. However, for this class, we are going to take a random subset of 70000 loans. The code we write will work on both the subset and the full data set.

7. 1st example: using bigtabulate with bigmemory

The examples in this section will have you creating tables to summarize the mortgage data. The code starts by loading the bigtabulate package, which provides, among other functions, bigtable. The bigtable function is very similar to R's table function but was designed to be used with bigmemory. To use the function, specify the big matrix object as the first argument and the columns you'd like to tabulate over in the second. If you'd like to create nested tables, you'll create a vector of the column names.

8. Let's practice!

Let's practice!