Controlling the vocabulary with the Tokenizer

Let's drill down a bit more into the operation of the Tokenizer. In this exercise you will learn how to convert an arbitrary sentence to a sequence using a trained Tokenizer. Furthermore, you will learn to control the size of the vocabulary of the Tokenizer. You will also investigate what happens to the out-of-vocabulary (OOV) words when you limit the vocabulary size of a Tokenizer.

For this exercise, you have been provided with the en_tok Tokenizer that you previously implemented. The Tokenizer has been imported for you.

This exercise is part of the course

Machine Translation with Keras

View Course

Exercise instructions

Convert the following sentence to a sequence using the previous en_tok Tokenizer: she likes grapefruit , peaches , and lemons .
Create a new Tokenizer, en_tok_new with a vocabulary size of 50 and out-of-vocabulary word UNK.
Fit the new tokenizer on en_text data.
Convert the sentence she likes grapefruit , peaches , and lemons . to a sequence with the en_tok_new.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Convert the sentence to a word ID sequence
seq = ____.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence: ', seq)

# Define a tokenizer with vocabulary size 50 and oov_token 'UNK'
en_tok_new = ____(num_words=____, ____=____)

# Fit the tokenizer on en_text
en_tok_new.____(____)

# Convert the sentence to a word ID sequence
seq_new = en_tok_new.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence (with UNK): ', seq_new)
print('The ID 1 represents the word: ', en_tok_new.index_word[1])

Edit and Run Code