Padding sentences

You will now implement a function called sents2seqs() which you will later use to transform data conveniently to the format accepted by the neural machine translation (NMT) model. sents2seqs() accepts a list of sentence strings and,

Converts the sentences to a list of sequence of IDs,
Pad the sentences so that they have equal length and,
Optionally convert the IDs to onehot vectors.

You have been provided with en_tok, a Tokenizer already trained on data. Another thing to note is that when implementing sents2seqs() function you will see an unused argument called input_type. Later, this input_type will be used to change language dependent parameters such as the length of the sequence and size of the vocabulary.

This exercise is part of the course

Machine Translation with Keras

View Course

Exercise instructions

Convert the sentences to sequences using the en_tok Tokenizer.
Pad sequences to a fixed en_len length with a specified padding type of pad_type and use post-truncating.
Convert the preproc_text word IDs to onehot vectors of length en_vocab using the to_categorical() function.
Convert sentence to a padded sequence using the sents2seqs() method using pre-padding.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

def sents2seqs(input_type, sentences, onehot=False, pad_type='post'):
	# Convert sentences to sequences      
    encoded_text = ____.____(sentences)
    # Pad sentences to en_len
    preproc_text = ____(____, padding=____, truncating=____, maxlen=____)
    if onehot:
		# Convert the word IDs to onehot vectors
        preproc_text = ____(____, num_classes=____)
    return preproc_text
sentence = 'she likes grapefruit , peaches , and lemons .'  
# Convert a sentence to sequence by pre-padding the sentence
pad_seq = sents2seqs('source', [____], pad_type=____)

Edit and Run Code