1. Beyond n-grams: word embeddings
We have covered
a lot of ground in the last 4 chapters. However, before we bid adieu, we will cover one advanced topic that has a large number of applications in NLP.
2. The problem with BoW and tf-idf
Consider the three sentences, I am happy,
I am joyous
and I am sad.
Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.
3. Word embeddings
Word embedding is the process of mapping words
into an n-dimensional vector space. These vectors are usually produced using
deep learning models and huge amounts of data. The techniques used are beyond the scope of this course. However, once generated, these vectors can be used to
discern how similar two words are to each other. Consequently, they can also be used to detect
synonyms and antonyms. Word embeddings are also capable of capturing
complex relationships. For instance, it can be used to detect that the words king and queen
relate to each other the same way as man and woman. Or that France and Paris
are related in the same way as Russia and Moscow. One last thing to note is that word embeddings are not trained on user data; they are dependent
on the pre-trained spacy model you're using and are independent of the size of your dataset.
4. Word embeddings using spaCy
Generating word embeddings is easy using spaCy's pre-trained models. As usual, we load the spacy model
and create the doc object for our string. Note that it is advisable to load larger spacy models while working with word vectors. This is because the en_core_web_sm model does not technically ship with word vectors but context specific tensors, which tend to give relatively poorer results. We generate word vectors for each word
by looping through the tokens and accessing the vector attribute. The
truncated output is as shown.
5. Word similarities
We can compute how similar two words are to each other by using the similarity method of a spacy token. Let's say we want to compute how similar happy, joyous and sad are to each other. We define a doc
containing the three words. We then use a nested loop to calculate the similarity scores between each pair of words. As expected,
happy and joyous are more similar to each other than they are to sad.
6. Document similarities
Spacy also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. Let's consider the three sentences from before. We create doc objects
for the sentences. Like spacy tokens, docs also have a similarity method. Therefore, we can compute the
similarity between two docs as follows. As expected, I am happy is more similar to I am joyous than it is to I am sad. Note that the similarity scores are high in both cases because all sentences share 2 out of their three words, I and am.
7. Let's practice!
With this, we come to an end of this lesson. Let's now practice our new found skills in the last set of exercises.