TF-IDF vectorization

1. TF-IDF vectorization

Let's now talk about a powerful enhancement to BoW: TF-IDF, or Term Frequency–Inverse Document Frequency.

2. From BoW to TF-IDF

Although BoW is powerful, one of its limitations is that it treats all words as equally important. In a dataset containing two documents, each with one sentence: "I love this NLP course" and "I enjoyed this project", a word that appears in both sentences, such as "I" or "this", receives the same importance as unique and informative terms like "course" or "project". TF-IDF helps us fix that. It tells us not just how often a word appears in a document but also how unique or meaningful that word is across the entire collection of documents.

3. TF-IDF

TF-IDF is the product of two terms: TF and IDF.

4. TF-IDF

TF stands for Term Frequency, and tells how many times a word appears in a document.

5. TF-IDF

IDF, stands for Inverse Document Frequency, and tells how rare that word is across all documents. Given this formula, if a word shows up often in one document, but not in many others, it gets a high TF-IDF score. But if it appears in every document, its score goes down.

6. TF-IDF with code

Suppose we have a few reviews as follows. Before computing the TF-IDF scores, we apply the previously introduced preprocess() function, which uses a list comprehension to lowercase each review and remove punctuation. This gives us a list of cleaned_reviews.

7. TF-IDF with code

Next, we import TfidfVectorizer from sklearn.feature_extraction.text and initialize the vectorizer. We apply it to our cleaned data using vectorizer.fit_transform(cleaned_reviews). This gives us a sparse matrix.

8. TF-IDF output

To visualize it, we use .toarray(), and receive an array of TF-IDF scores, where each row represents a review, and each column represents a word from the vocabulary. To get the column names, we can use vectorizer.get_feature_names_out().

9. Visualizing scores as heatmap

The TF-IDF matrix is still a bit abstract, so let's visualize it. We convert the matrix into a DataFrame by giving the array of TF-IDF scores and the column names. Then we plot a heatmap of this DataFrame using sns.heatmap, passing suitable title and labels for both axis. This heatmap highlights which words are most relevant in each review. Brighter colors indicate higher relevance of a term within that specific review. For example, in the first review, "amazing" and "loved" stand out, while in the third review, "boring" and "hated" are the most prominent.

10. Comparing with BoW

Comparing with BoW, we see that all words are treated equally, even frequent, uninformative ones like "was" and "the". But TF-IDF reduces the importance of such common words. Instead, it highlights words like "amazing" or "hated", which are less frequent overall but more meaningful in context.

11. Let's practice!

Time to put this into practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Natural Language Processing (NLP) in Python

IntermediateSkill Level

4.8+

268 reviews