Using longer n-grams
So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:
- bigrams: Sequences of two consecutive words
- trigrams: Sequences of two consecutive words
These can be automatically created in your dataset by specifying the ngram_range
argument as a tuple (n1, n2)
where all n-grams in the n1
to n2
range are included.
This exercise is part of the course
Feature Engineering for Machine Learning in Python
Exercise instructions
- Import
CountVectorizer
fromsklearn.feature_extraction.text
. - Instantiate
CountVectorizer
while considering only trigrams. - Fit the vectorizer and apply it to the
text_clean
column in one step. - Print the feature names generated by the vectorizer.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import CountVectorizer
from sklearn.feature_extraction.text import ____
# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100,
stop_words='english',
____)
# Fit and apply trigram vectorizer
cv_trigram = ____(speech_df['text_clean'])
# Print the trigram features
print(cv_trigram_vec.____)