Part 2: Exploring the dataset

Now you will explore some attributes of the dataset. Specifically, you will determine the average length (i.e. number of words) of all sentences and the size of the vocabulary for the English dataset.

For this exercise, the English dataset en_text containing a list of English sentences has been provided. In this exercise you will be using a Python list-related function called <list>.extend() which is a different variant of the function <list>.append(). Let's understand the difference through an example. Say a=[1,2,3] and b=[4,5]. a.append(b) would result in a list [1,2,3,[4,5]] where a.extend(b) would result in [1,2,3,4,5].

Bu egzersiz

Machine Translation with Keras

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Compute the lengths of each sentence using the split() function and the len() function, while iterating through en_text.
Compute the mean length of sentences using numpy.
Populate the list all_words, in the for loop body, by adding in all the words found in sentences after tokenizing.
Convert the list all_words, to a set object and compute the length/size of the set.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Compute length of sentences
sent_lengths = [len(____.____(____)) for en_sent in ____]
# Compute the mean of sentences lengths
mean_length = np.____(____)
print('(English) Mean sentence length: ', mean_length)

all_words = []
for sent in en_text:
  # Populate all_words with all the words in sentences
  all_words.____(____.____(____))
# Compute the length of the set containing all_words
vocab_size = len(____(____))
print("(English) Vocabulary size: ", vocab_size)

Kodu Düzenle ve Çalıştır