Get startedGet started for free

TfIdf: More ways to transform text

1. TfIdf: More ways to transform text

We have extensively worked with a BOW and applied it using a CountVectorizer in Python. As powerful as BOW can be, sometimes we might want to try slightly more sophisticated approaches. In this video we will talk about one of them, an approach called TfIdf - term frequency inverse document frequency.

2. What are the components of TfIdf?

The term frequency tells us how often a given word appears within a document in the corpus. Each word in a document has its own term frequency. The inverse document frequency is commonly defined as the log-ratio between the total number of documents and the number of documents that contain a specific word. What inverse document frequency means is that rare words will have a high inverse document frequency.

3. TfIdf score of a word

When we multiply the tf and the idf scores, we obtain the TfIdf score of a word in a corpus. With BOW, words could have different frequency counts across documents but we did not account for the length of a document; whereas the TfIdf score of a word incorporates the length of a document. TfIdf will also highlight words that are more interesting, i.e. words that are common in a document but not across all documents. However, note that interesting does not have to relate to a positive or a negative review. It is purely an unsupervised approach.

4. How is TfIdf useful?

In our Twitter sentiment analysis, names of airline companies such as United and Virgin America are likely to have low TfIdf scores since they occur many times and across many documents, i.e. tweets. If a tweet talks a lot about the check-in service of a company and there are not many other tweets discussing the topic, words in this tweet are likely to have a high TfIdf score. Note that since TfIdf penalizes frequent words, there is less of a need to explicitly define stop words. We can still remove stop words, of course, to restrict the size of our vocabulary. Even though TfIdf is relatively simple, it is quite commonly used in information retrieval and search engines as a way to rank the relevance of the returned queries.

5. TfIdf in Python

In Python, you can apply TfIdf by importing the TfidfVectorizer from sklearn.feature_extraction.text. The TfIdfVectorizer is similar to the CountVectorizer, and so are the arguments it takes. We can define the maximum number of features by max_features, the type of n-grams to use by specifying ngram_range, the stop_words argument, token_pattern, max_df and min_df. We fit the TfidfVectorizer to the text column of the tweets dataset. Then we transform it, the same way we did with the CountVectorizer.

6. TfidfVectorizer

The Tfidfvectorizer also returns a sparse matrix. If you recall, a sparse matrix is a matrix with mostly zero values, storing only the non-zero values. We need to transform the sparse matrix to an array and specify the feature names, using the same syntax as with the CountVectorizer. Inspecting the top 5 rows of the newly created dataset, we see that the output is quite similar to a BOW. Each column is a feature and each row contains the TfIdf score of the feature in a given tweet. The values are floating numbers, and many of them are zero.

7. Let's practice!

Let's wrap up our discussion on numerical transformation of text data by solving some exercises using TfIdf. See you in the next video!