GRU and LSTM cells

1. GRU and LSTM cells

In this lesson you will learn about two different RNN cells that will achieve good results in language modeling and solve the vanishing gradient problem.

2. SimpleRNN cell in detail

Let's first have a detailed look of the SimpleRNN cell. On every cell, we compute the new memory state based on the previous memory state t minus one and the current input word Xt. In the computations, we have a weight matrix Wa that is shared between all steps. We will consider the case of classification tasks and thus the output y hat will be computed only in the last step.

3. GRU cell

GRU cells were proposed in 2014, and add one gate to the vanilla RNN cell. Now before updating the memory cell, we first compute a candidate a-tilde that will carry the present information. Then we compute the update gate GU that will determine if the candidate a tilde will be used as memory state or if we keep the past memory state a minus one. If the gate is zero, the network keeps the previous hidden state, and if it is equal to one it uses the new value of a tilde. Other values will be a combination of the previous and the candidate memory state, but during training it tends to get close to zero or one.

4. LSTM cell

LSTM was first proposed in 1997, and adds three gates to the vanilla RNN cell. The forget gate g_f determines if the previous state c_t minus one state should be forgotten (meaning to have its value set to zero) or not. The update gate g_u do the same for the candidate hidden state c tilde. The output gate g_o do the same for the new hidden state c_t. The green circles on the picture represent the gates. We can think of them as an open or closed gate, allowing for the left side to pass through or not if the gates value are 0 or 1 respectively.

5. No more vanishing gradients

Because GRU and LSTM cells add gates to the equations, the gradients are no longer only dependent on the memory cell state. The derivative of the loss function with respect to the weights matrix depends on all the gates and on the memory cell, summing each of its parts. Without going into deeper details on the math, this architecture adds the different gradients (corresponding to the gradients of each gate and the memory state), making the total gradient stop converging to zero or diverging. On every step, if the gradient is exponentially increasing or decreasing, we expect the training phase to adjust the value of the corresponding gate accordingly to stop this vanishing or exploding tendency.

6. Usage in keras

Without further discussing the intuition and the theory, let's put the new RNN cells in practice inside keras. First, the layers with the GRU and LSTM cells are available in the keras dot layers dot recurrent, with a shortcut on keras dot layers. To use the GRU and LSTM cells on a keras model, we simple add them as usual. The important parameters are the number of units, meaning the number of memory cells to keep track, and the return sequences parameter that is used when adding more than one layer in sequence, making all the cells to emit an output that will be fed to the next layer as input.

7. Let's practice!

Now that you understand how to differentiate the RNN cells, lets practice them using keras!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Recurrent Neural Networks (RNNs) for Language Modeling with Keras

AdvancedSkill Level

4.8+

86 reviews