Get startedGet started for free

Capturing metadata in tm

Depending on what you are trying to accomplish, you may want to keep metadata about the document when you create a corpus.

To capture document-level metadata, the column names and order must be:

  1. doc_id - a unique string for each document
  2. text - the text to be examined
  3. ... - any other columns will be automatically cataloged as metadata.

Sometimes you will need to rename columns in order to fit the expectations of DataframeSource(). The names() function is helpful for this.

tweets exists in your workspace as a data frame with columns "num", "text", "screenName", and "created".

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Exercise instructions

  • Rename the first column of tweets to "doc_id".
  • Set the document schema with DataframeSource() on the smaller tweets data frame.
  • Make the document collection a volatile corpus nested in the custom clean_corpus() function.
  • Apply content() to the first tweet with double brackets such as text_corpus[[1]] to see the cleaned plain text.
  • Confirm that all metadata was captured using the meta() function on the first document with single brackets.

Remember, when accessing part of a corpus, the double or single brackets make a difference! For this exercise, you will use double brackets with content() and single brackets with meta().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Rename columns
___(tweets)[1] <- "___"

# Set the schema: docs
docs <- ___(___)

# Make a clean volatile corpus: text_corpus
text_corpus <- clean_corpus(___(___))

# Examine the first doc content
___(text_corpus[[___]])

# Access the first doc metadata
___(text_corpus[___])
Edit and Run Code