Capturing metadata in tm
Depending on what you are trying to accomplish, you may want to keep metadata about the document when you create a corpus.
To capture document-level metadata, the column names and order must be:
doc_id
- a unique string for each documenttext
- the text to be examined...
- any other columns will be automatically cataloged as metadata.
Sometimes you will need to rename columns in order to fit the expectations of DataframeSource()
. The names()
function is helpful for this.
tweets
exists in your workspace as a data frame with columns "num", "text", "screenName", and "created".
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Rename the first column of
tweets
to "doc_id". - Set the document schema with
DataframeSource()
on the smallertweets
data frame. - Make the document collection a volatile corpus nested in the custom
clean_corpus()
function. - Apply
content()
to the first tweet with double brackets such astext_corpus[[1]]
to see the cleaned plain text. - Confirm that all metadata was captured using the
meta()
function on the first document with single brackets.
Remember, when accessing part of a corpus, the double or single brackets make a difference! For this exercise, you will use double brackets with content()
and single brackets with meta()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Rename columns
___(tweets)[1] <- "___"
# Set the schema: docs
docs <- ___(___)
# Make a clean volatile corpus: text_corpus
text_corpus <- clean_corpus(___(___))
# Examine the first doc content
___(text_corpus[[___]])
# Access the first doc metadata
___(text_corpus[___])