Get startedGet started for free

Beyond n-grams: word embeddings

1. Beyond n-grams: word embeddings

We have covered a lot of ground in the last 4 chapters. However, before we bid adieu, we will cover one advanced topic that has a large number of applications in NLP.

2. The problem with BoW and tf-idf

Consider the three sentences, I am happy, I am joyous and I am sad. Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.

3. Word embeddings

Word embedding is the process of mapping words into an n-dimensional vector space. These vectors are usually produced using deep learning models and huge amounts of data. The techniques used are beyond the scope of this course. However, once generated, these vectors can be used to discern how similar two words are to each other. Consequently, they can also be used to detect synonyms and antonyms. Word embeddings are also capable of capturing complex relationships. For instance, it can be used to detect that the words king and queen relate to each other the same way as man and woman. Or that France and Paris are related in the same way as Russia and Moscow. One last thing to note is that word embeddings are not trained on user data; they are dependent on the pre-trained spacy model you're using and are independent of the size of your dataset.

4. Word embeddings using spaCy

Generating word embeddings is easy using spaCy's pre-trained models. As usual, we load the spacy model and create the doc object for our string. Note that it is advisable to load larger spacy models while working with word vectors. This is because the en_core_web_sm model does not technically ship with word vectors but context specific tensors, which tend to give relatively poorer results. We generate word vectors for each word by looping through the tokens and accessing the vector attribute. The truncated output is as shown.

5. Word similarities

We can compute how similar two words are to each other by using the similarity method of a spacy token. Let's say we want to compute how similar happy, joyous and sad are to each other. We define a doc containing the three words. We then use a nested loop to calculate the similarity scores between each pair of words. As expected, happy and joyous are more similar to each other than they are to sad.

6. Document similarities

Spacy also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. Let's consider the three sentences from before. We create doc objects for the sentences. Like spacy tokens, docs also have a similarity method. Therefore, we can compute the similarity between two docs as follows. As expected, I am happy is more similar to I am joyous than it is to I am sad. Note that the similarity scores are high in both cases because all sentences share 2 out of their three words, I and am.

7. Let's practice!

With this, we come to an end of this lesson. Let's now practice our new found skills in the last set of exercises.