Capturing metadata in tm

Depending on what you are trying to accomplish, you may want to keep metadata about the document when you create a corpus.

To capture document-level metadata, the column names and order must be:

doc_id - a unique string for each document
text - the text to be examined
... - any other columns will be automatically cataloged as metadata.

Sometimes you will need to rename columns in order to fit the expectations of DataframeSource(). The names() function is helpful for this.

tweets exists in your workspace as a data frame with columns "num", "text", "screenName", and "created".

Rename the first column of tweets to "doc_id".
Set the document schema with DataframeSource() on the smaller tweets data frame.
Make the document collection a volatile corpus nested in the custom clean_corpus() function.
Apply content() to the first tweet with double brackets such as text_corpus[[1]] to see the cleaned plain text.
Confirm that all metadata was captured using the meta() function on the first document with single brackets.

Remember, when accessing part of a corpus, the double or single brackets make a difference! For this exercise, you will use double brackets with content() and single brackets with meta().

Jumping into Text Mining with Bag-of-Words

Word Clouds and More Interesting Visuals

Adding to Your TM Skills

Battle of the Tech Giants for Talent

Exercise

Capturing metadata in tm

Instructions