N-grams

1. Bag of words and N-grams

So far you have looked at individual words on their own without any context or word order, this approach is called a bag-of-words model, as the words are treated as if they are being drawn from a bag with no concept of order or grammar. While analyzing the occurrences of individual words can be a valuable way to create features from a piece of text, you will notice that individual words can loose all their context/meaning when viewed independently.

2. Issues with bag of words

Take for example the word 'happy' shown here. One would assume it was used in a positive context, but if in reality it was used in the phrase 'not happy' this assumption would be incorrect. Similarly if the phrase was extended to 'never not happy' the connotation changes again. One common method to retain at least some concept of word order in a text is to instead use multiple consecutive words like pairs (bi-gram) or three consecutive words (tri-grams). This maintains at least some ordering information while at the same time allowing for the creation of a reasonable set of features.

3. Using N-grams

To leverage n-grams in your own models an additional argument "ngram_range", can be specified when instantiating your TF-IDF vectorizer. The values assigned to the argument are the minimum and maximum length of n-grams to be included. In this case you would only be looking at bi-grams (n-grams with two words) Printing the bi-gram features created we can see the pairs of words instead of single words.

4. Finding common words

As mentioned in the last video, when creating new features, you should always take time to check your work, and ensure that the features you are creating make sense. A good way to check your n-grams is to see what are the most common values being recorded. This can be done by summing the values of your DataFrame of count values that you created using the sum() method.

5. Finding common words

After sorting the values in descending order you can see the most commonly occurring values. It comes as no surprise that the most commonly occurring bi-gram in a dataset of US president's speeches is "United States" which indicates that the features being created make sense.

6. Let's practice!

You should now be able to try out many different combinations of text based features. It can be interesting to go further and explore the most common longer n-grams such as three word sequences called tri-grams.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Feature Engineering for Machine Learning in Python

IntermediateSkill Level

4.8+

651 reviews