Aan de slagGa gratis aan de slag

n-gram models for movie tag lines

In this exercise, we have been provided with a corpus of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.

We will then compare the number of features generated for each model.

Deze oefening maakt deel uit van de cursus

Feature Engineering for NLP in Python

Cursus bekijken

Oefeninstructies

  • Generate an n-gram model with n-grams up to n=1. Name it ng1
  • Generate an n-gram model with n-grams up to n=2. Name it ng2
  • Generate an n-Gram Model with n-grams up to n=3. Name it ng3
  • Print the number of features for each model.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
ng1 = vectorizer_ng1.____(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
ng2 = vectorizer_ng2.____(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(____, ____))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.____[1], ng2.____[1], ng3.____[1]))
Code bewerken en uitvoeren