1. Introduction to word vectors
Welcome! Let's learn about word vectors.
2. Word vectors (embeddings)
Word vectors, or word embeddings, are numerical representations of words that allow computers to perform complex tasks using text data.
The purpose of word vectors is to allow a computer to understand words. Computers cannot understand text as is, but they can process numbers efficiently. For this reason, we'll convert words into numbers.
Traditional methods, such as the "bag-of-words" method, take all words in a corpus and convert them into a unique number, when creating word vectors. These words are then stored in a dictionary where "I" can be mapped to one, "got" can be mapped to two and so on.
The older methods allow a computer to understand words numerically, however, they do not enable understanding the meaning of the words.
Consider an example with two sentences: "I got covid" and "I got coronavirus".
These sentences represented as numerical arrays of [1, 2, 3] and [1, 2, 4] respectively with a bag of words model. The two sentences are identical, but they have different word embeddings.
The computer does not have a certain way of knowing that the words "covid" and "coronavirus" refer to the same thing. The model just sees these as two different words represented by two different numbers. Hence, the model is oblivious to context and semantics.
3. Word vectors
But all hope is not lost. We can use recent methodologies to find word vectors that can be used to teach a computer if two words have similar meanings.
Word vectors have a pre-defined number of dimensions. Statistical and machine learning models take into account word frequencies in a corpus and the presence of other words in similar contexts.
A computer can then use this information to understand the similarity of words numerically by using vectors. For example, the table shows 7-dimensional word vectors that can help distinguish animals from houses or cats from dogs by capturing different aspects of these words from their surrounding context in a large corpus of text.
4. Word vectors
There are multiple approaches to produce word vectors. Some of the most well-known algorithms are word2vec, Glove, fastText, and transformer-based models.
To process and train n-dimensional word vectors, Word2vec and fastText use neural network architectures, while Glove uses the word co-occurrences matrix and transformer-based models use more complex architectures to train and predict word vectors. spaCy uses some of these methodologies to enable access to word vectors.
5. spaCy vocabulary
Word vectors are a part of many spaCy models, however, a few of the models do not have word vectors.
For instance, en_core_web_sm model, the small spaCy model, does not have any word vectors, while the medium-sized model, en_core_web_md, has 20,000-word vectors.
We can learn about the size of vocabulary and word vector dimensions by checking the value of the nlp-dot-meta for the keyword "vectors".
6. Word vectors in spaCy
When using spaCy, we can only extract vectors of words that exist in a model's vocabulary.
We use the nlp-dot-vocab method of a spaCy model to access a vocabulary object. Then the nlp-dot-vocab-dot-strings attribute of a Vocab object can be used to access word IDs in the vocabulary.
Later, the vocab-dot-vectors can be used to access word vectors of a word using its ID.
For example, given a word "like", we first access the mapping of the word to its ID in the vocabulary using nlp-dot-vocab-strings["like"], then use this ID to access the corresponding word vector using nlp-dot-vocab-dot-vectors[extracted_word-id].
7. Let's practice!
Great, let's exercise our learnings!