Part 2: Exploring the dataset

Now you will explore some attributes of the dataset. Specifically, you will determine the average length (i.e. number of words) of all sentences and the size of the vocabulary for the English dataset.

For this exercise, the English dataset en_text containing a list of English sentences has been provided. In this exercise you will be using a Python list-related function called <list>.extend() which is a different variant of the function <list>.append(). Let's understand the difference through an example. Say a=[1,2,3] and b=[4,5]. a.append(b) would result in a list [1,2,3,[4,5]] where a.extend(b) would result in [1,2,3,4,5].

This exercise is part of the course

Machine Translation with Keras

View Course

Exercise instructions

Compute the lengths of each sentence using the split() function and the len() function, while iterating through en_text.
Compute the mean length of sentences using numpy.
Populate the list all_words, in the for loop body, by adding in all the words found in sentences after tokenizing.
Convert the list all_words, to a set object and compute the length/size of the set.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Compute length of sentences
sent_lengths = [len(____.____(____)) for en_sent in ____]
# Compute the mean of sentences lengths
mean_length = np.____(____)
print('(English) Mean sentence length: ', mean_length)

all_words = []
for sent in en_text:
  # Populate all_words with all the words in sentences
  all_words.____(____.____(____))
# Compute the length of the set containing all_words
vocab_size = len(____(____))
print("(English) Vocabulary size: ", vocab_size)

Edit and Run Code