n-gram models for movie tag lines
In this exercise, we have been provided with a corpus
of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.
We will then compare the number of features generated for each model.
Diese Übung ist Teil des Kurses
Feature Engineering for NLP in Python
Anleitung zur Übung
- Generate an n-gram model with n-grams up to n=1. Name it
ng1
- Generate an n-gram model with n-grams up to n=2. Name it
ng2
- Generate an n-Gram Model with n-grams up to n=3. Name it
ng3
- Print the number of features for each model.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
ng1 = vectorizer_ng1.____(corpus)
# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
ng2 = vectorizer_ng2.____(corpus)
# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(____, ____))
ng3 = vectorizer_ng3.fit_transform(corpus)
# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.____[1], ng2.____[1], ng3.____[1]))