Get startedGet started for free

Gensim bag-of-words

Now, you'll use your new gensim corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!

You have access to the dictionary and corpus objects you created in the previous exercise, as well as the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

  • defaultdict allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise.

  • itertools.chain.from_iterable() allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).

The fifth document from corpus is stored in the variable doc, which has been sorted in descending order.

This exercise is part of the course

Introduction to Natural Language Processing in Python

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.____(____), ____)
    
# Create the defaultdict: total_word_count
total_word_count = ____
for word_id, word_count in itertools.chain.from_iterable(corpus):
    ____[____] += ____
Edit and Run Code