Neural Machine Translation

1. Neural Machine Translation

In the previous lessons you learned about text generation models, now you will explore how to create machine translation models.

2. Encoder and decoders

As seen before, neural machine translation models are divided in two parts: the encoder that creates a language model for the input language and decoder for the output language.

3. Encoder example

Let's create an example encoder. We use the sequential class for the model. Then, we add the embedding layer that will create word vectors for the input language. The parameter mask zero inserts a zero token for the vocabulary. Next, we add one LSTM layer with 128 units. The next step is to repeat the last state of the LSTM layer the number of the output language length, meaning that this will be used as input for the decoder part of the model. Note that repeatvector is a keras layer not introduced before.

4. Decoder example

Continuing right after the encoder, we add one LSTM layer that will return the sequences. Then we use time distributed together with dense layer to add one dense to each unit of the previous layer. This is useful when we are comparing the whole sentence instead of the final result like in classification. Keeping return sequences equal to False, the loss function is computed only on the last token, while keeping return sequences equal to True and using TimeDistributed layer, the loss function is applied in every token. This keras layer is also newly introduced.

5. Data prep

When preparing the data, as before we need to transform the text into sequence of numerical indexes. We do that for both languages in the encoder and decoder part of the model. The encoder will transform the input language into a sequence of numerical indexes. The decoder, apart from transforming the texts into a sequence of numerical indexes, it is also needed to one-hot encode each index to treat them as one class among the total vocabulary size of the output language.

6. Data preparation for the input language

To prepare the input language, we first import the required objects, being the Tokenizer class from keras dot preprocessing dot text module and the pad sequences function from keras dot preprocessing dot sequence module. Then we instantiate the tokenizer and fit the object on the input texts, that are a list-like object containing the sentences in the input language. Then we transform the input sentences into sequence of numerical indexes by using the method texts_to_sequences of the tokenizer. Finally, we pad the texts by inserting zero tokens to the right.

7. Tokenize the output language

The first step for the output language is the same as in the input language. We use the tokenizer and fit it on the output texts. Then we transform the output texts and pad the sequences of numerical indexes.

8. One-hot encode the output language

After transforming texts into sequences of numbers, we still need to one-hot encode each of the indexes. We create a temporary list. Then we loop over each sentence of numerical indexes. We use the function to_categorical to transform each index in this sentence to the one-hot vector. The output language vocabulary size is the number of classes. Then we append it to the temporary list. After the loop is completed, we transform this temporary list into a numpy array and reshape it to have three dimensions: number of sentences, sentence length and output language's vocabulary size.

9. Note on training and evaluating

To train the model, we just need to call the method fit and pass the training data contained on variables X and Y. To evaluate a translation model, we can use the Bilingual Evaluation Understudy, or BLEU metric. This method is not in the scope of this course, and for details on it, check the nltk documentation on submodule nltk dot translate dot bleu_score.

10. Let's practice!

NMT models can be very complex, here we introduced the main points of it. Let's see them in practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.