Get startedGet started for free

Document term matrices

1. Document term matrices

Now that we have some intuition about what a topic model is, and specifically what LDA is, let's get ready to run a topic model! To do that, we will need to get comfortable navigating in and out of our tidy text data frame, starting with creating a document term matrix.

2. Matrices and sparsity

We have focused on the benefits of using a tidy text data frame, where each row is a single word or token used in each document. However, to run a topic model, we first need to create a document term matrix or DTM. To illustrate, sparse_review is a subset of the DTM we'll create. It includes the unique word or term counts for just four reviews and only those terms that begin with A. We can see that a matrix is like a data frame, except every column has to be of the same type. We index a matrix by referring to its rows and then its columns. A document term matrix has a single row for each document and a column for every unique word or term used across all documents in the corpus. The values in the DTM are the count of tokens or uses of each term for the given document. Looking at sparse_review, we can see that the most common value is zero. When you have a matrix that is composed mostly of zeros, this is referred to as sparsity or a sparse matrix. Our DTM will likely be sparse, since most documents don't use most of the terms that are present across the corpus.

3. Using cast_dtm()

To create a DTM, let's take our tidy_review data and count each word in each review, indicated here by the id column. Now we can use the cast_dtm() function from the tidytext package to easily cast our tidy data frame into a DTM. To use cast_dtm(), we need to specify the document column, the term column, and the word counts, in that order. However, the output here isn't very informative. We are told that the output is indeed a DTM, where there are 1,791 documents or reviews and 9,669 terms, or unique words. There is also a reference to sparsity, or how many of the entries in this matrix are non-zero. As we might expect, the DTM is very sparse.

4. Using as.matrix()

To explore this DTM, we can use the as-dot-matrix() function after casting it into a DTM. We know this is a very large, very sparse matrix, so we index dtm_review using square brackets and a range of rows and columns using a colon. Here we are looking at the first four documents and the 2000th to 2004th terms. As an indication of sparsity, this subset of the DTM is composed entirely of zeros.

5. Let's practice!

A document term matrix is a common way to organize text. Let's practice casting our tidy data into a DTM and exploring its sparsity!