MulaiMulai sekarang secara gratis

Controlling the vocabulary with the Tokenizer

Let's drill down a bit more into the operation of the Tokenizer. In this exercise you will learn how to convert an arbitrary sentence to a sequence using a trained Tokenizer. Furthermore, you will learn to control the size of the vocabulary of the Tokenizer. You will also investigate what happens to the out-of-vocabulary (OOV) words when you limit the vocabulary size of a Tokenizer.

For this exercise, you have been provided with the en_tok Tokenizer that you previously implemented. The Tokenizer has been imported for you.

Latihan ini adalah bagian dari kursus

Machine Translation with Keras

Lihat Kursus

Petunjuk latihan

  • Convert the following sentence to a sequence using the previous en_tok Tokenizer: she likes grapefruit , peaches , and lemons .
  • Create a new Tokenizer, en_tok_new with a vocabulary size of 50 and out-of-vocabulary word UNK.
  • Fit the new tokenizer on en_text data.
  • Convert the sentence she likes grapefruit , peaches , and lemons . to a sequence with the en_tok_new.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

# Convert the sentence to a word ID sequence
seq = ____.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence: ', seq)

# Define a tokenizer with vocabulary size 50 and oov_token 'UNK'
en_tok_new = ____(num_words=____, ____=____)

# Fit the tokenizer on en_text
en_tok_new.____(____)

# Convert the sentence to a word ID sequence
seq_new = en_tok_new.____(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence (with UNK): ', seq_new)
print('The ID 1 represents the word: ', en_tok_new.index_word[1])
Edit dan Jalankan Kode