Session Ready
Exercise

Part 2: Exploring the dataset

Now you will explore some attributes of the dataset. Specifically, you will determine the average length (i.e. number of words) of all sentences and the size of the vocabulary for the English dataset.

For this exercise, the English dataset en_text containing a list of English sentences has been provided. In this exercise you will be using a Python list-related function called <list>.extend() which is a different variant of the function <list>.append(). Let's understand the difference through an example. Say a=[1,2,3] and b=[4,5]. a.append(b) would result in a list [1,2,3,[4,5]] where a.extend(b) would result in [1,2,3,4,5].

Instructions
100 XP
  • Compute the lengths of each sentence using the split() function and the len() function, while iterating through en_text.
  • Compute the mean length of sentences using numpy.
  • Populate the list all_words, in the for loop body, by adding in all the words found in sentences after tokenizing.
  • Convert the list all_words, to a set object and compute the length/size of the set.