Get startedGet started for free

Part 1: Preprocessing the Data

1. Part 1: Preprocessing the Data

Let's see what kind of preprocessing is needed before feeding data to the model.

2. Introduction to data

As you've already seen, the dataset consists of two Python lists, en_text and fr_text. en_text contains a list of English sentences, where a sentence is a single string with words separated by spaces. fr_text contains the corresponding translations of the English sentences. Here are some sentences from the dataset.

3. Word tokenization

One of the first preprocessing steps is word tokenization. You have already had a brief look at word tokenization in the previous chapter. Tokenization is the process of breaking a sentence or a phrase to tokens, for example individual words. Tokenizing to individual words is known as word tokenization. For example when the sentence "I watched a movie last night" is tokenized, that results in a Python list, where items are individual words. Note how tokenization gets rid of the punctuation marks as well. Remember that up to this point, you used a word2index dictionary to convert words to IDs. It will be very tedious to do this manually for even a small dataset like the one you are using. In Keras, you can automatically learn the words to IDs mapping using the Tokenizer object. This is located in the keras dot preprocessing dot text submodule. You can define a tokenizer object as follows.

4. Fitting the Tokenizer

To use the Tokenizer you first need to fit it on some text. This can be done by using the fit_on_texts function and passing a list of strings as the input. This will enable the tokenizer to learn the words to IDs mapping. Then you can access IDs learned by the tokenizer using the word_index attribute, which is a Python dictionary. You can also get the word corresponding to an ID using the index_word attribute.

5. Transforming sentences to sequences

After fitting the tokenizer on some text, you can use the texts_to_sequences function to convert a given string to a sequence of IDs. Note that, the input is a list of strings and the returned is a list of list of IDs.

6. Limiting the size of the vocabulary

But you should not leave the tokenizer to do everything automatically. For example, if you don't setup the tokenizer properly, it will learn many rare words in the dataset that are not powerful enough to improve the model. Therefore, it is good to limit the size of the vocabulary. You can do this using the num_words argument in the Tokenizer. In this example, you are setting the vocabulary size to be 50. Then the tokenizer will only consider the 50 most common words in your text when converting a string to a sequence of IDs and ignore the other words. Now how do you cope with out-of-vocabulary" words, or OOV words? These are words that didn't make the cut or didn't appear in the data provided at all. Think of a tokenizer trained with the sentence "I drank milk". For this tokenizer, "water" is an OOV word and will be ignored if you try to convert this sentence to a sequence.

7. Treating Out-of-Vocabulary words

There's another way to treat the OOV words in Keras. When defining the Tokenizer you can pass a string to the oov_token argument. This will make the Tokenizer replace any OOV word with the given token. Therefore unknown words will no longer be ignored but will be replaced with a special token.

8. Let's practice!

Great, now that you know how to use the Keras Tokenizer, let's practice that with few exercises.