Get startedGet started for free

Using word embedding for machine translation

1. Using word embedding for machine translation

In the previous exercises you learned about Teacher Forcing. Now you will explore word embeddings and how they can be used in machine translation.

2. Introduction to word embeddings

Word embeddings are sometimes called word vectors. Word vectors are numerical representations which capture the meaning of words. The onehot vectors you used earlier cannot capture the meaning of words. To understand word vectors, imagine a supermarket. Similar things, like salmon and chicken, will be placed in the refrigerated section, where cleaning items are placed far away from food. Word vectors behave similarly. For example, the similarity between word vectors of "cat" and "dog" will be high, where "cat" and "window" will have a lower similarity.

3. Similarity between word vectors

Let's compute the cosine similarity between vectors of cat, dog, and window. To do that, we will use a list of word vectors published by the Stanford NLP group, learned using a very large corpus. Cosine similarity gives a higher value to vectors having higher element-wise similarity (for example, cat and dog), meaning that those words have similar meaning or semantics. Therefore, using word embedding makes it easier for the model to capture the meaning of words, which enhances the performance.

4. Implementing embeddings for the encoder

Let's see how to use embeddings for machine translation. So far, you implemented the input layer as a 3D input of shape batch size by sequence length by vocabulary size, where each word is represented as an onehot vector.

5. Implementing embeddings for the encoder

However, when using an embedding layer, words are kept as word indices. Therefore, the input layer will be a two dimensional input of batch size by sequence length, where each value represents a word. These indices are passed to a Keras Embedding layer which will learn word vectors during the translation task.

6. Implementing embeddings for the encoder

The embedding layer will convert the word indices to a three dimensional output of batch size by sequence length by embedding size, captured in en_emb. When defining an embedding layer you need to provide three arguments - the vocabulary size, the embedding size, here defined as 96, and the length of the input sequence. The embedding layer is a matrix which has a single vector for each word in the vocabulary. In this example, for a 100 word vocabulary, it will be a 100 by 96 matrix.

7. Implementing the encoder with embedding

After the word indices are converted to word vectors through the embedding layer, everything else in the encoder becomes identical to what you have done up till now, where the output of the embedding layer is passed on to a GRU layer.

8. Implementing the decoder with embedding

The decoder inputs need to be modified similar to the encoder's inputs. Note that you specify two different embedding layers for the encoder and the decoder, as they are two different languages and will have different vocabulary sizes and sequence lengths.

9. Training the model

This model will be using both word embeddings and teacher forcing. During training, you go through multiple epochs and multiple iterations in each epoch. First you get the encoder inputs using the sents2seqs function. Make sure you set the onehot equals False as you need the word IDs, not the onehot vectors. Next you get the decoder inputs as a sequence of word IDs as well. Then you can create the decoder input by slicing the array such that all word IDs except the last is included. Here the time dimension is the last axis of the array. You then get the decoder targets which will be onehot encoded, and should only include word IDs except the first one. Finally, as you have all inputs and targets required, you can train the model by calling the train_on_batch function with the necessary inputs and outputs.

10. Let's practice!

Great! Let's see word embeddings in action.