Make a document-term matrix
Hopefully, you are not too tired after all this basic text mining work! Just in case, let's revisit the coffee and get some Starbucks while building a document-term matrix from coffee tweets.
Beginning with the coffee.csv
file, we have used common transformations to produce a clean corpus called clean_corp
.
The document-term matrix is used when you want to have each document represented as a row. This can be useful if you are comparing authors within rows, or the data is arranged chronologically, and you want to preserve the time series. The tm
package uses a "simple triplet matrix" class. However, it is often easier to manipulate and examine the object by re-classifying the DTM with as.matrix()
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Create
coffee_dtm
by applyingDocumentTermMatrix()
toclean_corp
. - Create
coffee_m
, a matrix version ofcoffee_dtm
, usingas.matrix()
. - Print the dimensions of
coffee_m
to the console using thedim()
function. Note the number of rows and columns. - Print the subset of
coffee_m
containing documents (rows) 25 through 35 and terms (columns)"star"
and"starbucks"
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create the document-term matrix from the corpus
coffee_dtm <- ___
# Print out coffee_dtm data
coffee_dtm
# Convert coffee_dtm to a matrix
coffee_m <- ___
# Print the dimensions of coffee_m
___
# Review a portion of the matrix to get some Starbucks
___[___:___, c("star", "___")]