Bag-of-words for book titles
PyBooks now has a list of book titles that need to be encoded for further analysis. The data team believes the Bag of Words (BoW) model could be the best approach.
The following packages have been imported for you: torch
, torchtext
.
This exercise is part of the course
Deep Learning for Text with PyTorch
Exercise instructions
- Import the
CountVectorizer
class for implementing bag-of-words. - Initialize an object of the class you imported, then use this object to transform the
titles
into a matrix representation. - Extract and display the first five feature names and encoded titles with the
get_feature_names_out()
method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import from sklearn
from sklearn.feature_extraction.text import ____
titles = ['The Great Gatsby','To Kill a Mockingbird','1984','The Catcher in the Rye','The Hobbit', 'Great Expectations']
# Initialize Bag-of-words with the list of book titles
vectorizer = ____()
bow_encoded_titles = ____.fit_transform(____)
# Extract and print the first five features
print(vectorizer.____[:5])
print(bow_encoded_titles.toarray()[0, :5])