1. Word vectors and spaCy
Welcome! Let's learn how to visualize word vectors and utilize them to find similar contexts.
2. Word vectors visualization
We can visualize word vectors in a scatter plot to help us understand how the vocabulary words are grouped.
In order to visualize word vectors, we need to project them into a two-dimensional space. We can project vectors by extracting the two principal components via Principal Component Analysis (PCA). We won't go into further details on PCA, but it is a way to reduce a high-dimensional dataset into a dataset of fewer dimensions (two in this case).
By applying PCA and projecting word vectors of words such as wonderful, horrible, apple, banana, orange, watermelon, dog, and cat in two-dimensional space, we see these words are grouped into three semantic classes (animals, fruits and emotional context). This shows that we are moving closer to finding the meaning of the words.
3. Word vectors visualization
Now that we have seen the projected word vectors, let us learn how to use matplotlib, spaCy and sklearn packages to create such a visualization.
First, we import the required libraries (matplotlib, PCA and numpy) and load a spaCy model.
Then we extract word vectors for a given list of words by using nlp-dot-vocab-dot-strings and nlp-dot-vocab-dot-vectors method. Later, we stack these vectors vertically using np-dot-vstack() method for PCA calculations.
4. Word vectors visualizations
Since the word vectors are 300-dimensions, we need to project them into two-dimensional space. We use the PCA library from sklearn and extract two principal components using pca-dot-fit_transform() method.
We can later use these two components as x and y coordinates per word by accessing 0 and 1 indices of the transformed word vectors and then visualize a scatter plot using plt-dot-text and plt-show methods.
5. Analogies and vector operations
Now into analogies and semantic understanding of the words! Word vectors can capture semantics and can also support vector operations, such as vector addition and subtraction.
A word analogy is a semantic relationship between a pair of words. There are many types of relationships, such as synonymity, anonymity, and whole-part relation. Some example pairs are (king - man, queen - woman) and (walked - walking, swam - swimming).
Word vectors can generate remarkable analogies such as gender and tense. For example, we can represent gender mapping between the queen and king as queen - woman + man = king. We subtract woman from queen and add man instead, and we get king. Then this analogy reads as queen is to king as woman is to man.
6. Similar words in a vocabulary
Now let us learn how we can use spaCy and its vocabulary to find similar words to a given term or phrase, such as "covid".
For this purpose, we first extract covid's word vector using nlp-dot-vocab-dot-vectors and nlp-dot-vocab-dot-strings as discussed before and convert it to a Numpy array using np-dot-asarray() method. Then we can use nlp-dot-vocab-vectors-most_similar() method to search among vectors of all the words in its vocabulary for five most similar terms.
We can see similar words such as covid-19, corona and covi in the output. These words are commonly present in the same context as the word "covid" and are semantically similar.
spaCy uses the most_similar() function to return word IDs of the most similar terms from its vocabulary by finding the word vectors that have the minimum distance to the word vector of covid. We then use nlp-dot-vocab-dot-strings inside brackets <the extracted-word-id> to find the similar words.
7. Let's practice!
Let's practice our learnings!