TFIDF Practice
Earlier you looked at a bag-of-words representation of articles on crude oil. Calculating TFIDF values relies on this bag-of-words representation, but takes into account how often a word appears in an article, and how often that word appears in the collection of articles.
To determine how meaningful words would be when comparing different articles, calculate the TFIDF weights for the words in crude, a collection of 20 articles about crude oil.
Latihan ini adalah bagian dari kursus
Introduction to Natural Language Processing in R
Petunjuk latihan
- Calculate TFIDF values for
crudebyarticle_idand byword. Save the resulting tibble ascrude_weights. - Sort
crude_weightswith thearrange()function by descendingtf_idfvalues. - Filter
crude_weightsto the lowest non-zerotf_idfvalues. Again, use thearrangefunction.
Latihan interaktif praktis
Cobalah latihan ini dengan menyelesaikan kode contoh berikut.
# Create a tibble with TFIDF values
___ <- crude_tibble %>%
unnest_tokens(output = "word", token = "words", input = text) %>%
anti_join(stop_words) %>%
count(article_id, word) %>%
___(___, ___, n)
# Find the highest TFIDF values
crude_weights %>%
___(desc(___))
# Find the lowest non-zero TFIDF values
crude_weights %>%
filter(___ != ___) %>%
___(___)