Controlling the vocabulary with the Tokenizer
Let's drill down a bit more into the operation of the Tokenizer
. In this exercise you will learn how to convert an arbitrary sentence to a sequence using a trained Tokenizer
. Furthermore, you will learn to control the size of the vocabulary of the Tokenizer
. You will also investigate what happens to the out-of-vocabulary (OOV) words when you limit the vocabulary size of a Tokenizer
.
For this exercise, you have been provided with the en_tok
Tokenizer
that you previously implemented. The Tokenizer
has been imported for you.
This exercise is part of the course
Machine Translation with Keras
Exercise instructions
- Convert the following sentence to a sequence using the previous
en_tok
Tokenizer:she likes grapefruit , peaches , and lemons .
- Create a new
Tokenizer
,en_tok_new
with a vocabulary size of 50 and out-of-vocabulary wordUNK
. - Fit the new tokenizer on
en_text
data. - Convert the sentence
she likes grapefruit , peaches , and lemons .
to a sequence with theen_tok_new
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Convert the sentence to a word ID sequence
seq = ____.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence: ', seq)
# Define a tokenizer with vocabulary size 50 and oov_token 'UNK'
en_tok_new = ____(num_words=____, ____=____)
# Fit the tokenizer on en_text
en_tok_new.____(____)
# Convert the sentence to a word ID sequence
seq_new = en_tok_new.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence (with UNK): ', seq_new)
print('The ID 1 represents the word: ', en_tok_new.index_word[1])