1. Understanding an R corpus
Hello again, and welcome to chapter 2. We will soon build off of the regular expression and tokenization lessons from chapter 1, but first, lets explore collections of texts, which we will call a Corpus.
2. Corpora
From the R documentation,
corpora are collections of documents containing natural language text.
The most common representation of this is from the tm package, and is called a Corpus. "Corpora" is the plural form of "corpus."
More specifically though, a VCorpus, or volatile corpus, is usually used to host both the text,as well as the metadata about the collection of text we are using.
3. Contents of a VCorpus: metadata
The tm package contains an example dataset for us to use, called acq,
which contains 50 articles from the Reuters dataset. The data is not that important here, but understanding the VCorpus structure is.
For the first article, we access the meta data by calling the meta item.
There is data about the author, a time stamp, heading, and much more. All of this data might be useful to our analysis going forward and it is important that we know how to access it.
4. Contents of a VCorpus: metadata
Furthermore, the individual objects within meta can also be accessed. If you have not worked with nested objects before, pay attention to this structure.
Here we accessed the meta item of the first article and then the character value for the place where the article originated,
which has a value of USA.
5. Contents of a VCorpus: content
We can also see that the VCorpus contains the actual text of each document
by accessing the content item.
Keeping the contents of a VCorpus as is allows us to use several analysis functions from the tm package. However, to do other analysis, we need to have a tidier version of the data.
6. Tidying a corpus
In order to get the data into a table format, where each observation is represented by a row, and each variable is a column,
we can use the tidy() function on a VCorpus. This will grab the values of each article, for both the metadata and the content, and convert it into a tibble.
We have already been working with tibbles when we tokenized and removed stopwords in the previous chapter.
7. Creating a corpus
Of course there is the flip side to this operation. Perhaps you are interested in using some analysis functions from the tm package, but you only have a tibble.
We can convert the text from a tibble into the format of a VCorpus. We use the VCorpus() function, along with the VectorSource() function, to create a corpus. This basically tells R to create a corpus by using a vector of text.
However, this only captures the text. To add the meta information from the tidy data, we use the meta function,
which adds a column to the metadata data frame attached to the corpus.
8. Let's see this in action.
Let's practice going back and fourth between corpus objects and tibbles.