Get startedGet started for free

Using longer n-grams

So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:

  • bigrams: Sequences of two consecutive words
  • trigrams: Sequences of two consecutive words

These can be automatically created in your dataset by specifying the ngram_range argument as a tuple (n1, n2) where all n-grams in the n1 to n2 range are included.

This exercise is part of the course

Feature Engineering for Machine Learning in Python

View Course

Exercise instructions

  • Import CountVectorizer from sklearn.feature_extraction.text.
  • Instantiate CountVectorizer while considering only trigrams.
  • Fit the vectorizer and apply it to the text_clean column in one step.
  • Print the feature names generated by the vectorizer.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import CountVectorizer
from sklearn.feature_extraction.text import ____

# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100, 
                                 stop_words='english', 
                                 ____)

# Fit and apply trigram vectorizer
cv_trigram = ____(speech_df['text_clean'])

# Print the trigram features
print(cv_trigram_vec.____)
Edit and Run Code