1. Learn
  2. /
  3. Courses
  4. /
  5. Machine Translation with Keras

Exercise

Padding sentences

You will now implement a function called sents2seqs() which you will later use to transform data conveniently to the format accepted by the neural machine translation (NMT) model. sents2seqs() accepts a list of sentence strings and,

  • Converts the sentences to a list of sequence of IDs,
  • Pad the sentences so that they have equal length and,
  • Optionally convert the IDs to onehot vectors.

You have been provided with en_tok, a Tokenizer already trained on data. Another thing to note is that when implementing sents2seqs() function you will see an unused argument called input_type. Later, this input_type will be used to change language dependent parameters such as the length of the sequence and size of the vocabulary.

Instructions

100 XP
  • Convert the sentences to sequences using the en_tok Tokenizer.
  • Pad sequences to a fixed en_len length with a specified padding type of pad_type and use post-truncating.
  • Convert the preproc_text word IDs to onehot vectors of length en_vocab using the to_categorical() function.
  • Convert sentence to a padded sequence using the sents2seqs() method using pre-padding.