Gensim bag-of-words
Now, you'll use your new gensim
corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!
You have access to the dictionary
and corpus
objects you created in the previous exercise, as well as the Python defaultdict
and itertools
to help with the creation of intermediate data structures for analysis.
defaultdict
allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argumentint
, we are able to ensure that any non-existent keys are automatically assigned a default value of0
. This makes it ideal for storing the counts of words in this exercise.itertools.chain.from_iterable()
allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through ourcorpus
object (which is a list of lists).
The fifth document from corpus
is stored in the variable doc
, which has been sorted in descending order.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Save the fifth document: doc
doc = corpus[4]
# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)
# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
print(dictionary.____(____), ____)
# Create the defaultdict: total_word_count
total_word_count = ____
for word_id, word_count in itertools.chain.from_iterable(corpus):
____[____] += ____