Part 2: Exploring the dataset

Now you will explore some attributes of the dataset. Specifically, you will determine the average length (i.e. number of words) of all sentences and the size of the vocabulary for the English dataset.

For this exercise, the English dataset en_text containing a list of English sentences has been provided. In this exercise you will be using a Python list-related function called <list>.extend() which is a different variant of the function <list>.append(). Let's understand the difference through an example. Say a=[1,2,3] and b=[4,5]. a.append(b) would result in a list [1,2,3,[4,5]] where a.extend(b) would result in [1,2,3,4,5].

Questo esercizio fa parte del corso

Machine Translation with Keras

Visualizza il corso

Istruzioni dell'esercizio

Compute the lengths of each sentence using the split() function and the len() function, while iterating through en_text.
Compute the mean length of sentences using numpy.
Populate the list all_words, in the for loop body, by adding in all the words found in sentences after tokenizing.
Convert the list all_words, to a set object and compute the length/size of the set.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Compute length of sentences
sent_lengths = [len(____.____(____)) for en_sent in ____]
# Compute the mean of sentences lengths
mean_length = np.____(____)
print('(English) Mean sentence length: ', mean_length)

all_words = []
for sent in en_text:
  # Populate all_words with all the words in sentences
  all_words.____(____.____(____))
# Compute the length of the set containing all_words
vocab_size = len(____(____))
print("(English) Vocabulary size: ", vocab_size)

Modifica ed esegui il codice