The TFIDF

1. The TFIDF

A simple bag-of-words representation is a great start. In this lesson however, we will expand this idea and introduce the infamous TFIDF.

2. Bag-of-word pitfalls

Consider these three pieces of text. We learn about John and Joe, that they are best friends, and that they like tacos. We can clean this text by removing stop words and punctuation, and lower casing all words. This is a simple example, but if we just compare the words they have in common, text 1 and 2 would be a lot more similar than text 1 and 3.

3. Sharing common words

Let's compare the words from each text. When comparing t1 and t2, they share almost every single word. However, when comparing t1 and t3, they don't seem to share that many words. So t1 and t2 must be more similar than t1 and t3!

4. Tacos matter

Let's return to the original text. One word that should matter here, above all others, is tacos. All three texts have John and Joe, but only two texts share John, Joe, and tacos. This is the main idea behind why we use what is known as the term frequency-inverse document frequency matrix.

5. TFIDF

The TF-IDF is a way of representing word counts by considering two components. The term frequency, which represents the proportion of words in a text that are that specific term. For example, clean_t1 has 4 words total, John is one of those 4, so the term frequency is .25. Next, the inverse document frequency, which considers how frequent words appear relative to the full collection of text. For example, John appears in all 3 clean texts, so the IDF for John is 0. Let's look at how the IDF portion can be calculated.

6. IDF Equation

The inverse document frequency has several different calculation methods, but we will use the most common form. Which is the log of the total number of documents, divided by the number of documents that contain the word. We can quickly calculate the IDF for some of the words we saw on the previous slide. Taco appears in 2 out of 3 documents, so it gets an IDF weight of .405.

7. TF + IDF

Putting these two calculations together is straight forward. The TFIDF value for Tacos for each text can be calculated. Consider clean text 3. Tacos is 1 out of 6 total words, and tacos has an IDF of .405. The TFIDF is the multiplication of these two values, which is .0675.

8. Calculating the TFIDF matrix

In order to quickly implement a TFIDF calculation, we can use tidytext's bind_tf_idf function. This builds off of our previous steps of tokenization, removing stop words, and counting words. We add the bind_tf_idf function at the end, and tell it to calculate the weights based on the word, the document id, and the word counts, which in this case comes from the count function. This function will calculate the term frequency and inverse document frequency values, and create a tibble.

9. bind_tf_idf output

The bind_tf_idf function quickly adds n, tf, idf, and tf_idf as columns to our output. We can see the tf and idf weights that have discussed on the previous slides. To get the tf_idf value, you simply multiple the two weights together.

10. TFIDF Practice

Let's look at a couple examples of to solidify our understanding of the TFIDF.

This exercise is part of the course

Introduction to Natural Language Processing in R

IntermediateSkill Level

5.0+

Start Course for Free

Chapter 1 of Introduction to Natural Langauge Processing prepares you for running your first analysis on text. You will explore regular expressions and tokenization, two of the most common components of most analysis tasks. With regular expressions, you can search for any pattern you can think of, and with tokenization, you can prepare and clean text for more sophisticated analysis. This chapter is necessary for tackling the techniques we will learn in the remaining chapters of this course.

Exercise 1: Regular expression basics Exercise 2: Practicing syntax with grep Exercise 3: Exploring regular expression functions.Exercise 4: Tokenization Exercise 5: tidytext functions Exercise 6: Tokenization: sentences Exercise 7: Text cleaning basics Exercise 8: Text preprocessing: remove stop words Exercise 9: Text preprocessing: Stemming

In this chapter, you will learn the most common and studied ways to analyze text. You will look at creating a text corpus, expanding a bag-of-words representation into a TFIDF matrix, and use cosine-similarity metrics to determine how similar two pieces of text are to each other. You build on your foundations for practicing NLP before you dive into applications of NLP in chapters 3 and 4.

Exercise 1: Understanding an R corpus Exercise 2: Explore an R corpus Exercise 3: Creating a tibble from a corpus Exercise 4: Creating a corpus Exercise 5: The bag-of-words representation Exercise 6: Practice BoW Exercise 7: BoW Example Exercise 8: Sparse matrices Exercise 9: The TFIDF

Current Exercise

Exercise 10: Manual calculations Exercise 11: TFIDF Practice Exercise 12: Cosine Similarity Exercise 13: An example of failing at text analysis Exercise 14: Cosine similarity example

Chapter 3 focuses on two common text analysis approaches, classification modeling, and topic modeling. If you are working on text analysis projects, you will inevitably use one or both of these methods. This chapter teaches you how to perform both techniques and provides insight into how to approach these techniques from a practical point of you.

Exercise 1: Preparing text for modeling Exercise 2: Data preparation Exercise 3: Removing sparse terms Exercise 4: Classification modeling Exercise 5: Classification modeling example Exercise 6: Confusion matrices Exercise 7: TFIDF tibble vs dtm Exercise 8: Introduction to topic modeling Exercise 9: LDA practice Exercise 10: Assigning topics to documents Exercise 11: LDA in practice Exercise 12: Testing perplexity Exercise 13: Reviewing LDA results

In chapter 4 we cover two staples of natural language processing, sentiment analysis, and word embeddings. These are two analysis techniques that are a must for anyone learning the fundamentals of text analysis. Furthermore, you will briefly learn about BERT, part-of-speech tagging, and named entity recognition. Almost 15 different analysis techniques were covered in this course, so chapter 4 ends by recapping all of the great techniques you will learn about in this course.

Exercise 1: Sentiment analysis Exercise 2: tidytext lexicons Exercise 3: Sentiment scores Exercise 4: Sentiment and emotion Exercise 5: Word embeddings Exercise 6: h2o practice Exercise 7: word2vec Exercise 8: Additional NLP analysis Exercise 9: Reviewing methods #1 Exercise 10: Review methods #2 Exercise 11: Conclusion