Make a VCorpus from a data frame
If your text data is in a data frame, you can use DataframeSource()
for your analysis. The data frame passed to DataframeSource()
must have a specific structure:
- Column one must be called
doc_id
and contain a unique string for each row. - Column two must be called
text
with "UTF-8" encoding (pretty standard). - Any other columns, 3+, are considered metadata and will be retained as such.
This exercise introduces meta()
to extract the metadata associated with each document. Often your data will have metadata such as authors, dates, topic tags, or places that can inform your analysis. Once your text is a corpus, you can apply meta()
to examine the additional document level information.
This is a part of the course
“Text Mining with Bag-of-Words in R”
Exercise instructions
In your workspace, there's a simple data frame called example_text
with the correct column names and some metadata. There is also vec_corpus
which is a volatile corpus made with VectorSource()
- Create
df_source
usingDataframeSource()
with theexample_text
. - Create
df_corpus
by convertingdf_source
to a volatile corpus object withVCorpus()
. - Print out
df_corpus
. Notice how many documents it contains and the number of retained document-level metadata points. - Use
meta()
ondf_corpus
to print the document associated metadata. - Examine the pre-loaded
vec_corpus
object. Compare the number of documents todf_corpus
. - Use
meta()
onvec_corpus
to compare any metadata found betweenvec_corpus
anddf_corpus
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a DataframeSource from the example text
df_source <- ___
# Convert df_source to a volatile corpus
df_corpus <- ___
# Examine df_corpus
df_corpus
# Examine df_corpus metadata
___
# Compare the number of documents in the vector source
vec_corpus
# Compare metadata in the vector corpus
___
This exercise is part of the course
Text Mining with Bag-of-Words in R
Learn the bag of words technique for text mining with R.
In this chapter, you'll learn the basics of using the bag-of-words method for analyzing text data.
Exercise 1: What is text mining?Exercise 2: Understanding text miningExercise 3: Quick taste of text miningExercise 4: Getting startedExercise 5: Load some textExercise 6: Make the vector a VCorpus object (1)Exercise 7: Make the vector a VCorpus object (2)Exercise 8: Make a VCorpus from a data frameExercise 9: Cleaning and preprocessing textExercise 10: Common cleaning functions from tmExercise 11: Cleaning with qdapExercise 12: All about stop wordsExercise 13: Intro to word stemming and stem completionExercise 14: Word stemming and stem completion on a sentenceExercise 15: Apply preprocessing steps to a corpusExercise 16: The TDM & DTMExercise 17: Understanding TDM and DTMExercise 18: Make a document-term matrixExercise 19: Make a term-document matrixWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.