Session Ready
Exercise

Gensim bag-of-words

Now, you'll use your new gensim corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!

You have access to the dictionary and corpus objects you created in the previous exercise, as well as the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

  • defaultdict allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise.

  • itertools.chain.from_iterable() allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).

The fifth document from corpus is stored in the variable doc, which has been sorted in descending order.

Instructions 1/2
undefined XP
  • 1
  • 2
  • Using the first for loop, print the top five words of bow_doc using each word_id with the dictionary alongside word_count.

    • The word_id can be accessed using the .get() method of dictionary.
  • Create a defaultdict called total_word_count in which the keys are all the token ids (word_id) and the values are the sum of their occurrence across all documents (word_count).

    • Remember to specify int when creating the defaultdict, and inside the second for loop, increment each word_id of total_word_count by word_count.