Make a VCorpus from a data frame
If your text data is in a data frame, you can use DataframeSource()
for your analysis. The data frame passed to DataframeSource()
must have a specific structure:
- Column one must be called
doc_id
and contain a unique string for each row. - Column two must be called
text
with "UTF-8" encoding (pretty standard). - Any other columns, 3+, are considered metadata and will be retained as such.
This exercise introduces meta()
to extract the metadata associated with each document. Often your data will have metadata such as authors, dates, topic tags, or places that can inform your analysis. Once your text is a corpus, you can apply meta()
to examine the additional document level information.
This is a part of the course
“Text Mining with Bag-of-Words in R”
Exercise instructions
In your workspace, there's a simple data frame called example_text
with the correct column names and some metadata. There is also vec_corpus
which is a volatile corpus made with VectorSource()
- Create
df_source
usingDataframeSource()
with theexample_text
. - Create
df_corpus
by convertingdf_source
to a volatile corpus object withVCorpus()
. - Print out
df_corpus
. Notice how many documents it contains and the number of retained document-level metadata points. - Use
meta()
ondf_corpus
to print the document associated metadata. - Examine the pre-loaded
vec_corpus
object. Compare the number of documents todf_corpus
. - Use
meta()
onvec_corpus
to compare any metadata found betweenvec_corpus
anddf_corpus
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a DataframeSource from the example text
df_source <- ___
# Convert df_source to a volatile corpus
df_corpus <- ___
# Examine df_corpus
df_corpus
# Examine df_corpus metadata
___
# Compare the number of documents in the vector source
vec_corpus
# Compare metadata in the vector corpus
___