TFIDF Practice

Earlier you looked at a bag-of-words representation of articles on crude oil. Calculating TFIDF values relies on this bag-of-words representation, but takes into account how often a word appears in an article, and how often that word appears in the collection of articles.

To determine how meaningful words would be when comparing different articles, calculate the TFIDF weights for the words in crude, a collection of 20 articles about crude oil.

This exercise is part of the course

Introduction to Natural Language Processing in R

View Course

Exercise instructions

Calculate TFIDF values for crude by article_id and by word. Save the resulting tibble as crude_weights.
Sort crude_weights with the arrange() function by descending tf_idf values.
Filter crude_weights to the lowest non-zero tf_idf values. Again, use the arrange function.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a tibble with TFIDF values
___ <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(article_id, word) %>%
  ___(___, ___, n)

# Find the highest TFIDF values
crude_weights %>%
  ___(desc(___))

# Find the lowest non-zero TFIDF values
crude_weights %>%
  filter(___ != ___) %>%
  ___(___)

Edit and Run Code

Introduction to Natural Language Processing in R

IntermediateSkill Level

4.8+

40 reviews

In chapter 4 we cover two staples of natural language processing, sentiment analysis, and word embeddings. These are two analysis techniques that are a must for anyone learning the fundamentals of text analysis. Furthermore, you will briefly learn about BERT, part-of-speech tagging, and named entity recognition. Almost 15 different analysis techniques were covered in this course, so chapter 4 ends by recapping all of the great techniques you will learn about in this course.

Exercise 1: Sentiment analysis Exercise 2: tidytext lexicons Exercise 3: Sentiment scores Exercise 4: Sentiment and emotion Exercise 5: Word embeddings Exercise 6: h2o practice Exercise 7: word2vec Exercise 8: Additional NLP analysis Exercise 9: Reviewing methods #1 Exercise 10: Review methods #2 Exercise 11: Conclusion