1. Building tf-idf document vectors
In the last chapter,
we learned about n-gram modeling.
2. n-gram modeling
In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent
on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human'
occurring 5 times. Then, the dimension of its vector representation corresponding
to 'human' would have the value 5.
3. Motivation
However, some words occur very commonly
across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the
Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe'
both occur about 20 times. However, 'jupiter' rarely figures
in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a
larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.
4. Applications
Weighting words this way has a huge number of applications. They can be used to automatically detect
stopwords for the corpus instead of relying on a generic list. They're used in search
algorithms to determine the ranking of pages containing the search query and in recommender systems
as we will soon find out. In a lot of cases, this kind of weighting also generates better
performance during predictive modeling.
5. Term frequency-inverse document frequency
The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional
to its frequency and an inverse function
of the number of documents in which it occurs.
6. Mathematical formula
Mathematically, the weight of a term i in document j is computed as
7. Mathematical formula
term frequency of the term i in document j
8. Mathematical formula
multiplied by the log of the ratio of the number of documents in the corpus and the number of documents in which the term i occurs or dfi.
9. Mathematical formula
Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library'
in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.
10. tf-idf using scikit-learn
Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using CountVectorizer, we use the TfidfVectorizer class
of scikit-learn. The parameters and methods it has is almost identical to CountVectorizer. The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However,
notice that the weights are non-integer and reflect values calculated by the tf-idf formula.
11. Let's practice!
That's enough theory for now. Let's practice!