Aan de slagGa gratis aan de slag

Part 1: Exploring the dataset

You will now explore the dataset a little bit. You will first get a feel of what the data looks like. You will print some of the data and learn how to tokenize the sentences in the data to individual words. For the English language, tokenization appears to be a trivial task, however, there are languages such as Japanese, which are not as consistently delimited as English.

For this exercise, you have been provided with two datasets: en_text and fr_text. The en_text contains a list of English sentences, where the fr_text contains the corresponding list of French sentences.

Deze oefening maakt deel uit van de cursus

Machine Translation with Keras

Cursus bekijken

Oefeninstructies

  • Write a zip() function that iterates through the first 5 sentences of the English sentences (en_text) and French sentences (fr_text).
  • Get the first English sentence from en_text.
  • Tokenize the obtained sentence using the split() function and the space character and assign it to first_words.
  • Print the tokenized words.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Iterate through the first 5 English and French sentences in the dataset
for en_sent, fr_sent in zip(____, ____):  
  print("English: ", en_sent)
  print("\tFrench: ", fr_sent)

# Get the first sentence of the English dataset
first_sent = ____[____]
print("First sentence: ", first_sent)
# Tokenize the first sentence
____ = ____.____(____)
# Print the tokenized words
print("\tWords: ", ____)
Code bewerken en uitvoeren