Make a VCorpus from a data frame
If your text data is in a data frame, you can use DataframeSource() for your analysis. The data frame passed to DataframeSource() must have a specific structure:
- Column one must be called
doc_idand contain a unique string for each row. - Column two must be called
textwith "UTF-8" encoding (pretty standard). - Any other columns, 3+, are considered metadata and will be retained as such.
This exercise introduces meta() to extract the metadata associated with each document. Often your data will have metadata such as authors, dates, topic tags, or places that can inform your analysis. Once your text is a corpus, you can apply meta() to examine the additional document level information.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
In your workspace, there's a simple data frame called example_text with the correct column names and some metadata. There is also vec_corpus which is a volatile corpus made with VectorSource()
- Create
df_sourceusingDataframeSource()with theexample_text. - Create
df_corpusby convertingdf_sourceto a volatile corpus object withVCorpus(). - Print out
df_corpus. Notice how many documents it contains and the number of retained document-level metadata points. - Use
meta()ondf_corpusto print the document associated metadata. - Examine the pre-loaded
vec_corpusobject. Compare the number of documents todf_corpus. - Use
meta()onvec_corpusto compare any metadata found betweenvec_corpusanddf_corpus.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a DataframeSource from the example text
df_source <- ___
# Convert df_source to a volatile corpus
df_corpus <- ___
# Examine df_corpus
df_corpus
# Examine df_corpus metadata
___
# Compare the number of documents in the vector source
vec_corpus
# Compare metadata in the vector corpus
___