1. Training the model with Teacher Forcing
Now let's learn how you can train the model you just defined.
2. Model training in detail
You briefly saw what actually happens in model training in the previous chapter. Let's discuss this in detail. First, model training requires a loss function and an optimizer.
3. Model training in detail
The loss is computed using two things: the prediction probabilities generated by the model, which will be a batch size by sequence length by French vocabulary size, sized output,
and the actual onehot encoded French words or targets which will be of the same shape as the predictions.
Then, in this model, the crossentropy loss is computed, which measures the difference between the targets and the predicted words.
Finally, the loss value is passed to an optimizer like "Adam" which minimizes this loss by changing the model parameters. The parameters are changed every time you call the train_on_batch() function you saw earlier.
4. Training the model with Teacher Forcing
While training, you will be iterating through the full dataset multiple times, that is, multiple epochs, where in each epoch, we visit all the data in the dataset as batches of data.
You can get the current batch of English and French sentences by taking the data in the range "i" to "i+bsize" denoted as i colon i+bsize. You will be onehot encoding both source and target language words, where you will also reverse the order of the source words. Next you will generate the decoder inputs and outputs from de_xy.
As you saw earlier, the decoder inputs will be all French words except the last, where the targets will be all French words except the first in the sequences. This is done by slicing the array de_xy. We will discuss slicing in more detail on the next slide.
Then you can call train_on_batch with the inputs, which is a list containing en_x and de_x and the outputs, which is de_y. Finally you can get the evaluation metrics by passing the same data to the evaluate function.
5. Array slicing in detail
de_xy contains all French words as onehot vectors.
de_x should contain all words except the last. You can do that by slicing the array on the time dimension. When you set the range to colon -1 it says, get all the time steps except for the last.
Then you can create de_y by setting the range to 1 colon on the time dimension, which says get all the time steps except for the first one. This is illustrated in the figure.
6. Creating training and validation data
You already saw that using a validation set is a good idea and allows you to prevent overfitting. Overfitting is the phenomenon, where the training accuracy keeps increasing but the validation accuracy starts dropping.
You will be splitting the data so that there will be 800 training sentences and 200 validation sentences. First you will create a list of indices of all the data and shuffle the indices randomly.
Then make the first 800 indices ,training indices and the last 200, validation indices. The shuffling helps to avoid any potential bias while selecting data.
Finally, you can use list comprehension to get the English and French data corresponding to training indices, and assign the extracted data to tr_en and tr_fr respectively. You then do the same with the valid_inds and assign validation data to v_en and v_fr.
7. Training with validation
What does the training with validation look like? First you need to do the training part just as you did earlier. That is, go through multiple epochs and within each epoch multiple iteration. And at each iteration you train the model on a batch of inputs and outputs.
Next, you apply the same transformations you applied to the training data, to the validation set. That is, first you generate the onehot encoded representations of validation data v_en_x and v_de_xy. Next, you create v_de_x, which contains all French words except for the last one and v_de_y, which contains all French words except for the first.
Finally, you use v_en_x, v_de_x and v_de_y to evaluate the model at each epoch by obtaining validation loss and validation accuracy.
8. Let's train!
Let's now train our model with teacher forcing.