1. Dense and TimeDistributed layers
In this lesson, you will learn about two important layers that will help you to implement the last part of the encoder-decoder based machine translation model: the Dense layer and the TimeDistributed layer.
2. Introduction to the Dense layer
A Dense layer can be used to implement a fully-connected layer of a neural network. The top part of the decoder needs a dense layer, as you need to predict the correct French word for each decoder position.
A Dense layer looks like what is shown here. It takes an input vector and produces an output vector, using weights and biases. But the output values won't make a valid probability distribution. Next to make the output a valid probability distribution over the classes, a softmax activation is applied to the output.
3. Understanding the Dense layer
You can easily create a Dense layer using this syntax. The first argument says how many classes or labels you have, which is 3 in this example. For machine translation this would be the size of the target language, that is, the size of the French vocabulary. Each class would then represent a single French word. We set the activation of the Dense layer to be softmax to make the output a valid probability distribution.
You can provide custom weight and bias initializations to a Dense layer using the kernel_initializer and bias_initializer arguments as well.
4. Inputs and outputs of the Dense layer
A Dense layer takes a batch size by input size array, for example, a 2 by 3 array.
Then it produces a batch size by num classes array.
If you look at the output, it would resemble a valid probability distribution over the classes for each input. For example the input 1 6 8 has the probability distribution 0.1, 0.3, 0.4 and 0.2 which sums to 1.
Finally, you can obtain the predicted classes for each input using np dot argmax function with the last axis, represented by -1. This is because the probabilities are present along the last axis.
5. Understanding the TimeDistributed layer
However as you have seen already, the output of the decoder GRU layer is a time series input of size batch size by sequence length by input size.
To enable the Dense layer to process a time series input, like the output of the decoder GRU layer, you can use a TimeDistributed layer wrapper. You can easily add a TimeDistributed layer as follows.
Then you can create a model which can process a time-series input and produce predictions.
6. Inputs and outputs of the TimeDistributed layer
A time distributed dense layer takes a batch size by sequence length by input size array and produces a batch size by sequence length by number of classes size array. In the example here, you have a 2 by 3 by 2 array.
The output has a probability distribution for each sample in the time-series input. You can see, each input in x is being transformed to a valid probability distribution over the 3 classes, resulting in a 2 by 3 by 3 array. For example, the input 1 6 is transformed to 0.1, 0.5 and 0.4.
You can get the predicted class for each sample similar to the dense layer. That is, using np dot argmax and providing the last axis as the axis.
7. Slicing data on time dimension
You can then iterate through time-distributed data using two "for" loops. In the first for loop, you iterate through the time dimension which is size 3. In the second for loop you obtain the t-th slice on the time dimension for both y and classes and print the data. Note that, this is only a single solution, and there are multiple ways to do this.
8. Let's practice!
Now you know how to use Dense and TimeDistributed layers. Let's practice!