ComenzarEmpieza gratis

Using longer n-grams

So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:

  • bigrams: Sequences of two consecutive words
  • trigrams: Sequences of two consecutive words

These can be automatically created in your dataset by specifying the ngram_range argument as a tuple (n1, n2) where all n-grams in the n1 to n2 range are included.

Este ejercicio forma parte del curso

Feature Engineering for Machine Learning in Python

Ver curso

Instrucciones del ejercicio

  • Import CountVectorizer from sklearn.feature_extraction.text.
  • Instantiate CountVectorizer while considering only trigrams.
  • Fit the vectorizer and apply it to the text_clean column in one step.
  • Print the feature names generated by the vectorizer.

Ejercicio interactivo práctico

Prueba este ejercicio completando el código de muestra.

# Import CountVectorizer
from sklearn.feature_extraction.text import ____

# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100, 
                                 stop_words='english', 
                                 ____)

# Fit and apply trigram vectorizer
cv_trigram = ____(speech_df['text_clean'])

# Print the trigram features
print(cv_trigram_vec.____)
Editar y ejecutar código