Make the vector a VCorpus object (2)
Now that we've converted our vector to a Source object, we pass it to another tm
function, VCorpus()
, to create our volatile corpus. Pretty straightforward, right?
The VCorpus
object is a nested list or list of lists. At each index of the VCorpus
object, there is a PlainTextDocument
object, which is a list containing actual text data (content
), and some corresponding metadata (meta
). It can help to visualize a VCorpus
object to conceptualize the whole thing.
To review a single document object (the 10th), you subset with double square brackets.
coffee_corpus[[10]]
To review the actual text, you index the list twice. To access the document's metadata, like timestamp, change [1]
to [2]
. Another way to review the plain text is with the content()
function, which doesn't need the second set of brackets.
coffee_corpus[[10]][1]
content(coffee_corpus[[10]])
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Call the
VCorpus()
function on thecoffee_source
object to createcoffee_corpus
. - Verify
coffee_corpus
is aVCorpus
object by printing it to the console. - Print the 15th element of
coffee_corpus
to the console to verify that it's aPlainTextDocument
that contains the content and metadata of the 15th tweet. Use double bracket subsetting. - Print the content of the 15th tweet in
coffee_corpus
. Use double brackets to select the proper tweet, followed by single brackets to extract the content of that tweet. - Print the
content()
of the 10th tweet withincoffee_corpus
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
## coffee_source is already in your workspace
# Make a volatile corpus from coffee_source
coffee_corpus <- ___
# Print out coffee_corpus
___
# Print the 15th tweet in coffee_corpus
___
# Print the contents of the 15th tweet in coffee_corpus
___
# Now use content to review the plain text of the 10th tweet
___(___[[___]])