Create a Tidy Text Tibble!
Since you learned about tidy principles this code helps you organize your data into a tibble so you can then work within the tidyverse!
Previously you learned that applying tidy()
on a TermDocumentMatrix()
object will convert the TDM to a tibble. In this exercise you will create the word data directly from the review column called comments
.
First you use unnest_tokens()
to make the text lowercase and tokenize the reviews into single words.
Sometimes it is useful to capture the original word order within each group of a corpus. To do so, use mutate()
. In mutate()
you will use seq_along()
to create a sequence of numbers from 1 to the length of the object. This will capture the word order as it was written.
In the tm
package, you would use removeWords()
to remove stopwords. In the tidyverse you first need to load the stop words lexicon and then apply an anti_join()
between the tidy text data frame and the stopwords.
This exercise is part of the course
Sentiment Analysis in R
Exercise instructions
- Create
tidy_reviews
by piping (%>%
) the original reviews objectbos_reviews
to theunnest_tokens()
function. Pass in a new column name,word
and declare thecomments
column. Remember in the tidyverse you don't need a$
or quotes. - Create a new variable the tidy way! Rewrite
tidy_reviews
by pipingtidy_reviews
togroup_by
with the columnid
. Then%>%
it again tomutate()
. Within mutate create a new variableoriginal_word_order
equal toseq_along(word)
. - Print out the tibble,
tidy_reviews
. - Load the premade "SMART" stopwords to your R session with
data("stop_words")
. - Overwrite
tidy_reviews
by passing the originaltidy_reviews
toanti_join()
with a%>%
. Withinanti_join()
pass in the predeterminedstop_words
lexicon.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Vector to tibble
tidy_reviews <- bos_reviews %>%
___(___, ___)
# Group by and mutate
tidy_reviews <- tidy_reviews %>%
___(___) %>%
___(original_word_order = ___(___))
# Quick review
___
# Load stopwords
___
# Perform anti-join
tidy_reviews_without_stopwords <- tidy_reviews %>%
___(___)