Padding sentences
You will now implement a function called sents2seqs()
which you will later use to transform data conveniently to the format accepted by the neural machine translation (NMT) model. sents2seqs()
accepts a list of sentence strings and,
- Converts the sentences to a list of sequence of IDs,
- Pad the sentences so that they have equal length and,
- Optionally convert the IDs to onehot vectors.
You have been provided with en_tok
, a Tokenizer
already trained on data. Another thing to note is that when implementing sents2seqs()
function you will see an unused argument called input_type
. Later, this input_type
will be used to change language dependent parameters such as the length of the sequence and size of the vocabulary.
This exercise is part of the course
Machine Translation with Keras
Exercise instructions
- Convert the
sentences
to sequences using theen_tok
Tokenizer. - Pad sequences to a fixed
en_len
length with a specified padding type ofpad_type
and usepost
-truncating. - Convert the
preproc_text
word IDs to onehot vectors of lengthen_vocab
using theto_categorical()
function. - Convert
sentence
to a padded sequence using thesents2seqs()
method usingpre
-padding.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
def sents2seqs(input_type, sentences, onehot=False, pad_type='post'):
# Convert sentences to sequences
encoded_text = ____.____(sentences)
# Pad sentences to en_len
preproc_text = ____(____, padding=____, truncating=____, maxlen=____)
if onehot:
# Convert the word IDs to onehot vectors
preproc_text = ____(____, num_classes=____)
return preproc_text
sentence = 'she likes grapefruit , peaches , and lemons .'
# Convert a sentence to sequence by pre-padding the sentence
pad_seq = sents2seqs('source', [____], pad_type=____)