1. Tf-Idf Representation
While counts of occurrences of words can be a good first step towards encoding your text to build models, it has some limitations. The main issue is counts will be much higher for very common even when they occur across all texts, providing little value as a distinguishing feature.
2. Introducing TF-IDF
Take for example the counts of the word "the" shown here, with plentiful occurrences in every row. To limit these common words from overpowering your model some form of normalization can be used. One of the most effective approaches to do this is called "Term Frequency Inverse Document Frequency" or TF-IDF.
3. TF-IDF
TF-IDF divides number of times a word occurs in the document by a measure of what proportion of the documents a word occurs in all documents.
This has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.
4. Importing the vectorizer
To use a TF-IDF vectorizer, the approach is very similar to how you applied a count vectorizer. First you must import TfidfVectorizer() from sklearn dot feature_extraction dot text, then you assign it to a variable name. Lets use tv in this case.
5. Max features and stopwords
Similar to when you were working with the Count vectorizer where you could limit the number of features created by specifying arguments when initializing TfidfVectorizer, you can specify the maximum number of features using max_features which will only use the 100 most common words. We will also specify the vectorizer to omit a set of stop_words, these are a predefined list of the most common english words such as "and" or "the". You can use scikit-learn's built in list, load your own, or use lists provided by other python libraries.
6. Fitting your text
Once the vectorizer has been specified you can fit it, and apply it to the text that you want to transform. Note that here we are fitting and transforming the train data, a subset of the original data.
7. Putting it all together
As before, you combine the TF-IDF values along with the feature names in a DataFrame as shown here.
8. Inspecting your transforms
After transforming your data you should always check how the different words are being valued, and see which words are receiving the highest scores through the process. This will help you understand if the features being generated make sense or not. One ad hoc method is to isolate a single row of the transformed DataFrame (`tv_df` in this case), using the iloc accessor, and then sorting the values in the row in descending order as shown here. These top ranked values make sense for the text of a presidential speech.
9. Applying the vectorizer to new data
So how do you apply this transformation on the test set? As mentioned before, you should preprocess your test data using the transformations made on the train data only.
To ensure that the same features are created you should use the same vectorizer that you fit on the training data. So first transform the test data using the tv vectorizer and then recreate the test dataset by combining the TF-IDF values, feature names, and other columns.
10. Let's practice!
So, now you also know about TF-IDF! Great, it's time for you to implement this.