Building n-gram models

1. Building n-gram models

We already know how to build bag-of-words representations of our documents and use it to conduct various machine learning tasks.

2. BoW shortcomings

Consider the following mini reviews. One is a positive review which states that the movie was good and not boring. The other is negative; commenting that the movie was not good and boring. If we were to construct BoW vectors for these reviews, we would get identical vectors since both reviews contain exactly the same words. And here in lies the biggest shortcoming of the bag-of-words model: context of the words is lost. In this example, the position of the word 'not' changes the entire sentiment of the review. Therefore, in this lesson, we will study techniques that will allow us to model this.

3. n-grams

An n-gram is a contiguous sequence of n elements (or words) in a given document. The bag-of-words model that we've explored so far is nothing but an n-gram model where n is equal to one. Let's now explore n-grams when n is greater than one. Consider the sentence 'for you a thousand times over'. If we set n to 2, then the n-grams (called bigrams in this case) would be for you, you a, a thousand, thousand times and times over.

4. n-grams

Similarly, for n equal to 3, the n-grams (or trigrams) will be for you a, you a thousand, a thousand times, thousand times over. Therefore, we can use these n-grams to capture more context and account for cases like 'not'.

5. Applications

Apart from capturing more context, n-grams have a host of other useful applications. They are used in sentence completion, spelling correction and machine translation correction. In all these cases, the model computes the probability of n words occurring contiguously to perform the above processes.

6. Building n-gram models using scikit-learn

Building these n-gram models using scikit-learn is extremely simple, now that we know how to use CountVectorizer. CountVectorizer takes in an argument ngram range which is a tuple containing the lower and upper bound for the range of n-values. For instance, passing 2,2 as the ngram_range will generate only bigrams. On the other hand, passing in 1,3 will generate n-grams where n is equal to 1, 2 and 3.

7. Shortcomings

While on the surface, it may seem lucrative to generate n-grams of high orders to capture more and more context, it comes with caveats. We've already seen that the BoW vectors run into thousands of dimensions. Adding higher order n-grams increases the number of dimensions even more and while performing machine learning, leads to a problem known as the curse of dimensionality. Additionally, n-grams for n greater than 3 become exceedingly rare to find in multiple documents. So that feature becomes effectively useless. For these reasons, it is often a good idea to restrict yourself to n-grams where n is small.

8. Let's practice!

Great! Let's now build these advanced n-gram models and discover more insights in the exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.