Getting granular with n-grams

1. Getting granular with n-grams

You might remember from an earlier video that with a bag-of-words approach the word order is discarded.

2. Context matters

Imagine you have a sentence such as 'I am happy, not sad' and another one 'I am sad, not happy'. They will have the same representation with a BOW, even though the meanings are inverted. In this case, putting NOT in front of the word (which is also called negation) changes the whole meaning and demonstrates why context is important.

3. Capturing context with a BOW

There is a way to capture the context when using a BOW by, for example, considering pairs or triples of tokens that appear next to each other. Let's define some terms. Single tokens are what we used so far and are also called 'unigrams'. Bigrams are pairs of tokens, trigrams are triples of tokens, and a sequence of n-tokens is called 'n-grams.'

4. Capturing context with BOW

Let's illustrate that with an example. Take the sentence 'The weather today is wonderful' and split it using unigrams, bigrams and trigrams. With unigrams we have single tokens, with bigrams, pairs of neighboring tokens, with trigrams: triples of neighboring tokens.

5. n-grams with the CountVectorizer

It is easy to implement n-grams with the CountVectorizer method. To specify the n-grams, we use the ngram_range parameter. The ngram_range is a tuple where the first parameter is the minimum length and the second parameter is the maximum length of tokens. For instance, ngram_range =(1, 1) means we will use only unigrams, (1, 2) means we will use unigrams and bigrams and so on.

6. What is the best n?

It's not easy to determine what is the optimal sequence you should use for your problem. If we use longer token sequence, this will result in more features. In principle, the number of bigrams could be the number of unigrams squared; trigrams the number of unigrams to the power of 3 and so forth. In general, having longer sequences results in more precise machine learning models, but this also increases the risk of overfitting. An approach to find the optimal sequence length would be to try different lengths in something like a grid search and see which results in the best model.

7. Specifying vocabulary size

Determining the length of token sequence is not the only way to determine the size of the vocabulary. There are a few parameters in the CountVectorizer that can also do that. You might remember we set the max_features parameter. The max_features can tell the CountVectorizer to take the top most frequent tokens in the corpus. If it is set to None, all the words in the corpus will be included. So this parameter can remove rare words, which depending on the context may or may not be a good idea. Another parameter you can specify is max_df. If given, it tells CountVectorizer to ignore terms with a higher than the given frequency. We can specify it as an integer - which will be an absolute count, or float - which will be a proportion. The default value of max_df is 1.0, which means it does not ignore any terms. Very similar to max_df is min_df. It is used to remove terms that appear too infrequently. It again can be specified either as an integer, in which case it will be a count, or a float, in which case it will be a proportion. The default value is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

8. Let's practice!

Let's go to the exercises where you will specify the token sequence length and the size of the vocabulary.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.