Controlling the vocabulary with the Tokenizer

Let's drill down a bit more into the operation of the Tokenizer. In this exercise you will learn how to convert an arbitrary sentence to a sequence using a trained Tokenizer. Furthermore, you will learn to control the size of the vocabulary of the Tokenizer. You will also investigate what happens to the out-of-vocabulary (OOV) words when you limit the vocabulary size of a Tokenizer.

For this exercise, you have been provided with the en_tok Tokenizer that you previously implemented. The Tokenizer has been imported for you.

Questo esercizio fa parte del corso

Machine Translation with Keras

Visualizza il corso

Istruzioni dell'esercizio

Convert the following sentence to a sequence using the previous en_tok Tokenizer: she likes grapefruit , peaches , and lemons .
Create a new Tokenizer, en_tok_new with a vocabulary size of 50 and out-of-vocabulary word UNK.
Fit the new tokenizer on en_text data.
Convert the sentence she likes grapefruit , peaches , and lemons . to a sequence with the en_tok_new.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Convert the sentence to a word ID sequence
seq = ____.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence: ', seq)

# Define a tokenizer with vocabulary size 50 and oov_token 'UNK'
en_tok_new = ____(num_words=____, ____=____)

# Fit the tokenizer on en_text
en_tok_new.____(____)

# Convert the sentence to a word ID sequence
seq_new = en_tok_new.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence (with UNK): ', seq_new)
print('The ID 1 represents the word: ', en_tok_new.index_word[1])

Modifica ed esegui il codice