Session Ready
Exercise

Using longer n-grams

So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:

  • bigrams: Sequences of two consecutive words
  • trigrams: Sequences of two consecutive words

These can be automatically created in your dataset by specifying the ngram_range argument as a tuple (n1, n2) where all n-grams in the n1 to n2 range are included.

Instructions
100 XP
  • Import CountVectorizer from sklearn.feature_extraction.text.
  • Instantiate CountVectorizer while considering only trigrams.
  • Fit the vectorizer and apply it to the text_clean column in one step.
  • Print the feature names generated by the vectorizer.