1. Tf-idf with gensim
In this video, we will learn how to use a TFIDF model with Gensim.
2. What is tf-idf?
**Tf-idf** stands for term-frequncy - inverse document frequency. It is a commonly used natural language processing model that helps you determine the most important words in each document in the corpus.
The idea behind tf-idf is that each corpus might have more shared words than just stopwords. These common words are like stopwords and should be removed or at least down-weighted in importance. For example, if I am an astronomer, sky might be used often but is not important, so I want to downweight that word.
TF-Idf does precisely that. It will take texts that share common language and ensure the most common words across the entire corpus don't show up as keywords. Tf-idf helps keep the document-specific frequent words weighted high and the common words across the entire corpus weighted low.
3. Tf-idf formula
The equation to calculate the weights can be outlined like so:
The weight of token i in document j is calculated by taking the term frequency (or how many times the token appears in the document) multiplied by the log of the total number of documents divided by the number of documents that contain the same term.
Let's unpack this a bit. First, the weight will be low if the term doesnt appear often in the document because the tf variable will then be low.
However, the weight will also be a low if the logarithm is close to zero, meaning the internal equation is low.
Here we can see if the total number of documents divded by the number of documents that have the term is close to one, then our logarithm will be close to zero.
So words that occur across many or all documents will have a very low tf-idf weight. On the contrary, if the word only occurs in a few documents, that logarithm will return a higher number.
4. Tf-idf with gensim
You can build a Tfidf model using Gensim and the corpus you developed previously. Taking a look at the corpus we used in the last video, around movie reviews, we can use the Bag of Words corpus to translate it into a TF-idf model by simply passing it in initialization.
We can then reference each document by using it like a dictionary key with our new tfidf model.
For the second document in our corpora, we see the token weights along with the token ids. Notice there are some large differences! Token id 10 has a weight of 0.77 whereas tokens 0 and 1 have weights below 0.18. These weights can help you determine good topics and keywords for a corpus with shared vocabulary.
5. Let's practice!
Now you can build a tfidf model using Gensim to explore topics in the Wikipedia article list.