TM refresher (II)
Now let's create a Document Term Matrix (DTM). In a DTM:
- Each row of the matrix represents a document.
- Each column is a unique word token.
- Values of the matrix correspond to an individual document's word usage.
The DTM is the basis for many bag of words analyses. Later in the course, you will also use the related Term Document Matrix (TDM). This is the transpose; that is, columns represent documents and rows represent unique word tokens.
You should construct a DTM after cleaning the corpus (using clean_corpus()
). To do so, call DocumentTermMatrix()
on the corpus object.
tm_dtm <- DocumentTermMatrix(tm_clean)
If you need a more in-depth refresher check out the Text Mining with Bag-of-Words in R course. Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).
This exercise is part of the course
Sentiment Analysis in R
Exercise instructions
We've created a VCorpus()
object called clean_text
containing 1000 tweets mentioning coffee. The tweets have been cleaned with the previously mentioned preprocessing steps and your goal is to create a DTM from it.
- Apply
DocumentTermMatrix()
to theclean_text
corpus to create a term frequency weighted DTM calledtf_dtm
. - Change the
DocumentTermMatrix()
object into a simple matrix withas.matrix()
. Call the new objecttf_dtm_m
. - Check the dimensions of the matrix using
dim()
. - Use square bracket indexing to see a subset of the matrix.
- Select rows 16 to 20, and columns 2975 to 2985
- Note the frequency value of the word "working."
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# clean_text is pre-defined
clean_text
# Create tf_dtm
tf_dtm <- ___
# Create tf_dtm_m
tf_dtm_m <- ___
# Dimensions of DTM matrix
___
# Subset part of tf_dtm_m for comparison
___[___, ___]