Get startedGet started for free

TM refresher (II)

Now let's create a Document Term Matrix (DTM). In a DTM:

  • Each row of the matrix represents a document.
  • Each column is a unique word token.
  • Values of the matrix correspond to an individual document's word usage.

The DTM is the basis for many bag of words analyses. Later in the course, you will also use the related Term Document Matrix (TDM). This is the transpose; that is, columns represent documents and rows represent unique word tokens.

You should construct a DTM after cleaning the corpus (using clean_corpus()). To do so, call DocumentTermMatrix() on the corpus object.

tm_dtm <- DocumentTermMatrix(tm_clean)

If you need a more in-depth refresher check out the Text Mining with Bag-of-Words in R course. Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!

Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).

This exercise is part of the course

Sentiment Analysis in R

View Course

Exercise instructions

We've created a VCorpus() object called clean_text containing 1000 tweets mentioning coffee. The tweets have been cleaned with the previously mentioned preprocessing steps and your goal is to create a DTM from it.

  • Apply DocumentTermMatrix() to the clean_text corpus to create a term frequency weighted DTM called tf_dtm .
  • Change the DocumentTermMatrix() object into a simple matrix with as.matrix(). Call the new object tf_dtm_m.
  • Check the dimensions of the matrix using dim().
  • Use square bracket indexing to see a subset of the matrix.
  • Select rows 16 to 20, and columns 2975 to 2985
  • Note the frequency value of the word "working."

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# clean_text is pre-defined
clean_text

# Create tf_dtm
tf_dtm <- ___

# Create tf_dtm_m
tf_dtm_m <- ___

# Dimensions of DTM matrix
___

# Subset part of tf_dtm_m for comparison
___[___, ___]
Edit and Run Code