Embeddings

1. Embeddings

Let's explore embeddings, an advanced technique to represent words as numbers!

2. Limitations of BoW and TF-IDF

So far, we have used methods like Bag-of-Words and TF-IDF to convert text into numbers. However, these approaches have significant limitations. They treat similar words, such as "movie" and "film", as completely unrelated. Therefore, they fail to capture the true meaning of the text.

3. Embeddings

Embeddings solve this problem by representing each word with a numerical vector that captures its meaning.

4. Embeddings

To get these embeddings, the model starts by assigning random values to each word

5. Embeddings

then refines them by predicting missing words in sentences, and adjusting the vectors based on word relationships in real text

6. Embeddings

At the end, words appearing in similar contexts, like movie and film, end up with similar representations. Note that we don't need to know how embeddings are generated to work with them. We just need to understand what they are.

7. Embeddings as GPS coordinates for words

Word embeddings are like GPS coordinates for language. Just as neighboring cities like Paris and Lyon are close on a map because they share a region, words like 'film' and 'movie' should be close in the embedding space because they share meaning.

8. Gensim

The Gensim library provides popular word embedding models like Word2Vec and GloVe in various versions, with the number in the model name indicating the vector size.

9. Loading an embedding model

To load a model, we import gensim.downloader as api. We call api.load specifying a model, such as 'glove-wiki-gigaword-50'. Once the model is loaded, it behaves as a KeyedVectors object, allowing us to retrieve a word's embedding by passing the word as a key. Here we see the embedding of 'movie' as a vector of fifty numbers.

10. Computing similarity

We can measure word similarity by passing two words to model.similarity, and getting a score from 0 to 1. Film and movie have a similarity score of 0.93, meaning their embeddings are 93% alike.

11. Finding most similar words

We use model.most_similar to find words similar to a given word, specifying how many results we want with the topn parameter. For movie, the top three similar words are: movies, film, and films, each with a similarity score.

12. Visualizing embeddings

To better understand embeddings, we need to visualize them, but high-dimensional data is difficult to view directly. Principal Component Analysis or PCA transforms high-dimensional vectors into 2D or 3D for visualization while preserving important patterns. Just like flattening a globe into a map. Some details are lost, but the big relationships remain clear.

13. Visualizing embeddings with PCA

We start by importing PCA from sklearn.decomposition, selecting the words we want to visualize, and extracting their embeddings. Next, we create a PCA model with n_components=2 and apply fit_transform to reduce the embeddings to two dimensions. We then use a scatter plot to display each word's coordinates and add labels using plt.annotate, providing the word and its position. The visualization shows that film and movie are close together, just like cat and dog, and car and bus. This confirms that embeddings effectively capture word meanings by grouping related words in space.

14. Comparison of embeddings

We can repeat the entire process with another model, such as word2vec-google-news-300. Even if we use both models in the same way, similar words will still be close to each other, but their positions will differ. This variation is expected, as each embedding model is trained differently, resulting in unique vector representations.

15. Let's practice!

Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.