DTM vs. tidytext matrix
The tidyverse is a collection of R packages that share common philosophies and are designed to work together. This chapter covers some tidy functions to manipulate data. In this exercise you will compare a DTM to a tidy text data frame called a tibble.
Within the tidyverse, each observation is a single row in a data frame. That makes working in different packages much easier since the fundamental data structure is the same. Parts of this course borrow heavily from the tidytext
package which uses this data organization.
For example, you may already be familiar with the %>%
operator from the magrittr
package. This forwards an object on its left-hand side as the first argument of the function on its right-hand side.
In the example below, you are forwarding the data
object to function1()
. Notice how the parentheses are empty. This in turn is forwarded to function2()
. In the last function you don't have to add the data
object because it was forwarded from the output of function1()
. However, you do add a fictitious parameter, some_parameter
as TRUE
. These pipe forwards ultimately create the object
.
object <- data %>%
function1() %>%
function2(some_parameter = TRUE)
To use the %>%
operator, you don't necessarily need to load the magrittr
package, since it is also available in the dplyr
package.
dplyr
also contains the functions inner_join()
(which you'll learn more about later) and count()
for tallying data. The last function you'll need is mutate()
to create new variables or modify existing ones.
object <- data %>%
mutate(new_Var_name = Var1 - Var2)
or to modify a variable
object <- data %>%
mutate(Var1 = as.factor(Var1))
You will also use tidyr
's pivot_wider()
function to organize the data with each row being a line from the book and the positive and negative values as columns.
index | negative | positive |
---|---|---|
42 | 2 | 0 |
43 | 0 | 1 |
44 | 1 | 0 |
To change a DTM to a tidy format use tidy()
from the broom
package.
tidy_format <- tidy(Document_Term_Matrix)
This exercise uses text from the Greek tragedy, Agamemnon. Agamemnon is a story about marital infidelity and murder. You can download a copy here.
This exercise is part of the course
Sentiment Analysis in R
Exercise instructions
We've already created a clean DTM called ag_dtm
for this exercise.
- Create
ag_dtm_m
by applyingas.matrix()
toag_dtm
. - Using brackets,
[
and]
, indexag_dtm_m
to row2206
. - Apply
tidy()
toag_dtm
. Call the new objectag_tidy
. - Examine
ag_tidy
at rows[831:835, ]
to compare the tidy format. You will see a common word from the examined part ofag_dtm_m
in step 2.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# As matrix
ag_dtm_m <- ___
# Examine line 2206 and columns 245:250
ag_dtm_m[___, 245:250]
# Tidy up the DTM
ag_tidy <- ___
# Examine tidy with a word you saw
ag_tidy[___, ]