1. Introduction to gensim
In this video, we will get started using a new tool called Gensim.
2. What is gensim?
**Gensim** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building document or word vectors, corpora and performing topic identification and document comparisons.
3. What is a word vector?
You might be wondering what a word or document vector is? Here are some examples here in visual form. A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document.
You can think of it as a multi-dimensional array normally with sparse features (lots of zeros and some ones). With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find.
For example, in this graphic we can see that the vector operation king minus queen is approximately equal to man minus woman. Or that Spain is to Madrid as Italy is to Rome. The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.
4. Gensim example
The graphic we have here is an example of LDA visualization. LDA stands for latent dirichlet allocation, and it is a statistical model we can apply to text using Gensim for topic analysis and modelling.
This graph is just a portion of a blog post written in 2015 using Gensim to analyze US presidential addresses. The article is really neat and you can find the link here.
5. Creating a gensim dictionary
Gensim allows you to build corpora and dictionaries using simple classes and functions. A corpus (or if plural, corpora) is a set of texts used to help perform natural language processing tasks.
Here, our documents are a list of strings that look like movie reviews about space or sci-fi films.
First we need to do some basic preprocessing. For brevity, we will only tokenize and lowercase. For better results, we would want to apply more of the preprocessing we have learned in this chapter, such as removing punctuation and stop words.
Then we can pass the tokenized documents to the Gensim Dictionary class. This will create a mapping with an id for each token. This is the beginning of our corpus. We now can represent whole documents using just a list of their token ids and how often those tokens appear in each document.
We can take a look at the tokens and their ids by looking at the token2id attribute, which is a dictionary of all of our tokens and their respective ids in our new dictionary.
6. Creating a gensim corpus
Using the dictionary we built in the last slide, we can then create a Gensim corpus. This is a bit different than a normal corpus -- which is just a collection of documents. Gensim uses a simple bag-of-words model which transforms each document into a bag of words using the token ids and the frequency of each token in the document.
Here, we can see that the Gensim corpus is a list of lists, each list item representing one document.
Each document a series of tuples, the first item representing the tokenid from the dictionary and the second item representing the token frequency in the document. In only a few lines, we have a new bag-of-words model and corpus thanks to Gensim.
And unlike our previous Counter-based bag of words, this Gensim model can be easily saved, updated and reused thanks to the extra tools we have available in Gensim. Our dictionary can also be updated with new texts and extract only words that meet particular thresholds. We are building a more advanced and feature-rich bag-of-words model which can then be used for future exercises.
7. Let's practice!
Now you can get started building your own dictionary with Gensim!