Get startedGet started for free

Introduction to Teacher Forcing

1. Introduction to Teacher Forcing

You will now learn a technique called Teacher Forcing, that is used to train machine translation models.

2. The previous machine translator model

In the previous model, the encoder takes in a sequence of onehot encoded English words and produces a context vector. Then, the decoder repeats the context vector and consumes it as a sequence of inputs to produce a sequence of GRU outputs. Finally, the prediction layer consumes the sequence of GRU outputs to output French words as probability distributions.

3. Analogy: Training without Teacher Forcing

Let's now understand how you trained the model previously with an analogy. Initially, a teacher, or the encoder tells a student he needs to translate "I like cats" to French.

4. Analogy: Training without Teacher Forcing

Then the teacher goes away and the student, or the decoder, outputs the full translation. Say he produced "Je chats".

5. Analogy: Training without Teacher Forcing

After the student produces the full translation, the teacher, or the decoder targets, appear and say, the actual translation should have been "J'aime les chats", which is used to train the model. Let's now look at how the training is modified when using teacher forcing.

6. Analogy: Training with Teacher Forcing

As before, the teacher will say, translate "I like cats" to French.

7. Analogy: Training with Teacher Forcing

Then the student, or the decoder, will produce one word. As soon as the student produces the word, the teacher comes and says, the correct word should have been "J'aime". Unlike the previous example, the teacher doesn't wait until the student produces the full translation, but guides the student at every step.

8. Analogy: Training with Teacher Forcing

Therefore, in the next step, the student might actually get the next word right.

9. Analogy: Training with Teacher Forcing

As you can see, teacher forcing helps the model to learn things quicker as there's more guidance during training and leads to better performance.

10. The previous machine translator model

In the models perspective, when teacher forcing is used, the decoder consumes the French words as the input instead of consuming a repeated context vector. The French words are fed to the decoder such that, each position of the decoder will have a certain word as the input and the target would be the next word in the sequence. For example, when the word "J'aime" is the input word, the output will be "les".

11. Implementing the model with Teacher Forcing

In the implementation, the encoder will be identical to the one you implemented earlier and outputs a context vector. The decoder of the new model will have a new input layer. This input layer will accept a sequence of onehot encoded French words. Note that the layer actually takes 1 word less than the full length.

12. Inputs and outputs

Let's understand this through an example. Assume the French sentence "J'aime les chiens". The input words will be "J'aime" and "les", where the corresponding output words be "les" and "chiens". As you can see, both inputs and outputs will have only two words though there are three French words.

13. Implementing the model with Teacher Forcing

The decoder prediction layer is implemented as same as before. That is, a Dense layer with fr_vocab nodes and a softmax activation, wrapped in a TimeDistributed layer.

14. Compiling the model

When defining the model, it will have two inputs: en_inputs and de_inputs which will be given as a list, where the output will be de_pred. Finally, you compile the model with a loss function, an optimizer and a metric.

15. Preprocessing data

When preprocessing data, first, the encoder inputs will be onehot encoded and the order will be reversed. Next you will get both decoder inputs and outputs at once and call de_xy. You can then get only the inputs using colon minus 1 on the time dimension of de_xy, which will give all words in the sequence except for the last. Finally, get the outputs by setting the range as one colon on the time dimension, which will return all words except the first.

16. Let's practice!

Great, it's time to practice what you've learned.