Session Ready
Exercise

TM refresher (II)

Now let's create a Document Term Matrix (DTM). In a DTM:

  • Each row of the matrix represents a document.
  • Each column is a unique word token.
  • Values of the matrix correspond to an individual document's word usage.

The DTM is the basis for many bag of words analyses. Later in the course, you will also use the related Term Document Matrix (TDM). This is the transpose; that is, columns represent documents and rows represent unique word tokens.

You should construct a DTM after cleaning the corpus (using clean_corpus()). To do so, call DocumentTermMatrix() on the corpus object.

tm_dtm <- DocumentTermMatrix(tm_clean)

If you need a more in-depth refresher check out the Text Mining: Bag of Words course. Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!

Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).

Instructions
100 XP

We've created a VCorpus() object called clean_text containing 1000 tweets mentioning coffee. The tweets have been cleaned with the previously mentioned preprocessing steps and your goal is to create a DTM from it.

  • Apply DocumentTermMatrix() to the clean_text corpus to create a term frequency weighted DTM called tf_dtm .
  • Change the DocumentTermMatrix() object into a simple matrix with as.matrix(). Call the new object tf_dtm_m.
  • Check the dimensions of the matrix using dim().
  • Use square bracket indexing to see a subset of the matrix.
  • Select rows 16 to 20, and columns 2975 to 2985
  • Note the frequency value of the word "working."