1. Part 2: Preprocessing the text
In this lesson you will be introduced to a few more preprocessing techniques that need to be performed before feeding data to the machine translation model.
2. Adding special starting/ending tokens
You will first add a special starting and an ending token to the target language sentences.
You can use sos to indicate the beginning of a French sentence and eos to mark the end of the sentence. This step is not essential to the current model you have. But it will be useful for an improved model you'll be implementing in the last chapter.
3. Padding the sentences
So far, all the sentences you worked with had a fixed length, that is, a fixed number of words. However, in real-world datasets this is never the case. A common method for dealing with this is, to pad the sentences with a special token so that they all have the same length. This also means that sentences that are longer than the specified length will be truncated.
To do this, Keras provides the pad_sequences function found in the keras dot preprocessing dot sequence submodule.
Let's understand this function through an example. First let's convert the sentences to a list of sequences using the texts_to_sequences function and save the result to seqs. This will be used in the next slide.
4. Padding the sentences
To bring these sentences to the same length you can call pad_sequences function. When calling pad_sequences function you need to set three important arguments. The first argument, "padding", which can be "pre" or "post", sets whether you want to pad the beginning or the end of the sentences. Next "truncating", which can be "pre" or "post", sets the truncating style. Lastly, "maxlen" sets the length you want to pad for.
In our example we are asking the function to pad at the end or truncate from the end so that the final length is 12. The first sequence, since it does not have 12 words have 0s added to the end. The last sentence having more than 12 words has been truncated.
The tokenizer will never allocate zero as a word ID, as it is used for special purposes like padding sequences.
5. Benefit of reversing sentences
Another preprocessing technique is reversing the word order of the source sentences. For example, the English sentence "we like dogs" is fed as "dogs like we" when feeding to the encoder. This is better because it helps to make a strong connection between the encoder and the decoder. When the source sentence is reversed the initial words of the two languages are closest to each other. For example the word "we" and "nous" refer to the same thing, which in turn helps the optimization process of the model. But unlike previous steps, this is a language dependent operation, as in some languages the subject will not be appearing as the first word.
6. Reversing the sentences
To start, let's convert the sentences to padded sequences again.
pad_sequences will return an array of shape number of sentences by sequence length.
7. Reversing the sentences
Therefore, the time dimension is the second axis.
When reversing the sentences, the first axis remains the same as indicated by the colon before the comma. To reverse the second axis, pass colon colon -1 for that dimension.
When you are reversing the encoder text, you also need to be careful about the type of padding you use. For example, it is desirable to pre-pad the encoder text with zeros as it will help to make a strong connection with the decoder. You can fix truncating as "post", regardless of the reversing step. Since you are reversing the sentences afterwards, you should use post-padding, because it becomes pre-padding after reversing the text.
8. Let's practice!
Now, you are aware of all the preprocessing techniques required. Let's practice some of these techniques.