Implementing the encoder

1. Implementing the encoder

Let's now learn how to implement the encoder of the machine translation model.

2. Understanding the data

But, before that, let's explore and understand the dataset. The dataset consists of two lists of strings. One list contains the English sentences and the other list contains the corresponding French sentences. You can print a few of the sentences and see what they look like. Here, we are printing the first 3 sentences of the English and French datasets.

3. Tokenizing the sentences

Let's now look at some attributes of the dataset such as, the average number of words, or the average length of the sentences, and the size of the vocabulary. These parameters are required to define the input layer of the encoder. The first step to computing these attributes is to tokenize the sentences. Tokenization is the process of extracting individual tokens, for example individual words. You can tokenize a sentence using the space character. For this you can use the Python split function and pass in the space character as the delimiter. This will return a list of words extracted from the sentence.

4. Computing the length of sentences

You can compute the average length of the sentence as follows. You iterate through the list of sentences while tokenizing each sentence to words. Then you compute the length of the resulting list of words, using the len function. Though it looks like a complex set of operations, this can be done with a single line using Python list comprehension syntax. You then use the np dot mean function to compute the average length.

5. Computing the size of the vocabulary

To compute the size of the vocabulary, you can again tokenize all sentences. And create a list containing all words in the dataset called all_words. Then you convert this to a set object. A set object will only contain unique items in the list, thus it will contain only the unique words of the vocabulary. Finally, we get the length of this set to compute the size of the vocabulary.

6. The encoder

Let's now use that information to implement the encoder. Remember that, the encoder is made from a GRU model which stands for Gated Recurrent Units. The GRU model goes from one input to the other sequentially, while producing an output (and a state) at each time step. The state vector produced at time equals t becomes an input state to the model at time equals t plus 1.

7. Implementing the encoder with Keras

The encoder will be very similar to the model you implemented while learning about the GRU layer. Knowing the average number of words helps us to define en_len. The size of the vocabulary helps us to define en_vocab. These are essential to define the input layer. You will pick these values to be close to what you discovered by analyzing the dataset. Next, you define a GRU layer which returns the last state. The last state of the GRU layer will be later passed to the decoder as inputs. Note that, though the output and state are identical for a GRU layer, you will treat them as two separate things as they can be different things in other sequential models. With that, you define a Keras model representing the encoder, whose input is the input layer and output is the state obtained from the GRU layer.

8. Understanding the Keras model summary

You can also print a summary of the model you defined using model dot summary function. You can see three columns in the output, the name and type of the layer, shapes of the outputs and the number of parameters in each layer. For example you can see that, the sequence length is set to 15 where the input size is set to 150 for the input layer. The GRU layer has 48 hidden units and produces two outputs, each having 48 values.

9. Let's practice!

Now that you've learned about the data and the encoder, let's have some fun!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.