Get Started

Make a VCorpus from a data frame

If your text data is in a data frame, you can use DataframeSource() for your analysis. The data frame passed to DataframeSource() must have a specific structure:

  • Column one must be called doc_id and contain a unique string for each row.
  • Column two must be called text with "UTF-8" encoding (pretty standard).
  • Any other columns, 3+, are considered metadata and will be retained as such.

This exercise introduces meta() to extract the metadata associated with each document. Often your data will have metadata such as authors, dates, topic tags, or places that can inform your analysis. Once your text is a corpus, you can apply meta() to examine the additional document level information.

This is a part of the course

“Text Mining with Bag-of-Words in R”

View Course

Exercise instructions

In your workspace, there's a simple data frame called example_text with the correct column names and some metadata. There is also vec_corpus which is a volatile corpus made with VectorSource()

  • Create df_source using DataframeSource() with the example_text.
  • Create df_corpus by converting df_source to a volatile corpus object with VCorpus().
  • Print out df_corpus. Notice how many documents it contains and the number of retained document-level metadata points.
  • Use meta() on df_corpus to print the document associated metadata.
  • Examine the pre-loaded vec_corpus object. Compare the number of documents to df_corpus.
  • Use meta() on vec_corpus to compare any metadata found between vec_corpus and df_corpus.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a DataframeSource from the example text
df_source <- ___

# Convert df_source to a volatile corpus
df_corpus <- ___

# Examine df_corpus
df_corpus

# Examine df_corpus metadata
___

# Compare the number of documents in the vector source
vec_corpus

# Compare metadata in the vector corpus
___
Edit and Run Code