Get Started

Make a VCorpus from a data frame

If your text data is in a data frame, you can use DataframeSource() for your analysis. The data frame passed to DataframeSource() must have a specific structure:

  • Column one must be called doc_id and contain a unique string for each row.
  • Column two must be called text with "UTF-8" encoding (pretty standard).
  • Any other columns, 3+, are considered metadata and will be retained as such.

This exercise introduces meta() to extract the metadata associated with each document. Often your data will have metadata such as authors, dates, topic tags, or places that can inform your analysis. Once your text is a corpus, you can apply meta() to examine the additional document level information.

This is a part of the course

“Text Mining with Bag-of-Words in R”

View Course

Exercise instructions

In your workspace, there's a simple data frame called example_text with the correct column names and some metadata. There is also vec_corpus which is a volatile corpus made with VectorSource()

  • Create df_source using DataframeSource() with the example_text.
  • Create df_corpus by converting df_source to a volatile corpus object with VCorpus().
  • Print out df_corpus. Notice how many documents it contains and the number of retained document-level metadata points.
  • Use meta() on df_corpus to print the document associated metadata.
  • Examine the pre-loaded vec_corpus object. Compare the number of documents to df_corpus.
  • Use meta() on vec_corpus to compare any metadata found between vec_corpus and df_corpus.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a DataframeSource from the example text
df_source <- ___

# Convert df_source to a volatile corpus
df_corpus <- ___

# Examine df_corpus
df_corpus

# Examine df_corpus metadata
___

# Compare the number of documents in the vector source
vec_corpus

# Compare metadata in the vector corpus
___

This exercise is part of the course

Text Mining with Bag-of-Words in R

IntermediateSkill Level
5.0+
7 reviews

Learn the bag of words technique for text mining with R.

In this chapter, you'll learn the basics of using the bag-of-words method for analyzing text data.

Exercise 1: What is text mining?Exercise 2: Understanding text miningExercise 3: Quick taste of text miningExercise 4: Getting startedExercise 5: Load some textExercise 6: Make the vector a VCorpus object (1)Exercise 7: Make the vector a VCorpus object (2)Exercise 8: Make a VCorpus from a data frame
Exercise 9: Cleaning and preprocessing textExercise 10: Common cleaning functions from tmExercise 11: Cleaning with qdapExercise 12: All about stop wordsExercise 13: Intro to word stemming and stem completionExercise 14: Word stemming and stem completion on a sentenceExercise 15: Apply preprocessing steps to a corpusExercise 16: The TDM & DTMExercise 17: Understanding TDM and DTMExercise 18: Make a document-term matrixExercise 19: Make a term-document matrix

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free