Getting used to text data
In this exercise, you will play with text data by analyzing quotes from Sheldon Cooper in The Big Bang Theory TV show. This will give you a chance to analyze sentences to obtain insights on what it's like to deal with real-world text data.
You will use dictionary comprehensions to create dictionaries that map words to indexes and vice versa. The use of dictionaries instead of, for example, a pandas.DataFrame
is because they are more intuitive and don't add unnecessary extra complexity.
The data is available in sheldon_quotes
with the first two sentences already printed for you.
This exercise is part of the course
Recurrent Neural Networks (RNNs) for Language Modeling with Keras
Exercise instructions
join
the sentences into one variable and then extract all the words and store this list inall_words
.- Remove the duplicated words by applying
list(set())
on the list of words and store them inunique_words
. - Create a dictionary with indexes as keys and words as values using dictionary comprehensions.
- Create a dictionary with words as keys and indexes as values using dictionary comprehensions.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Transform the list of sentences into a list of words
all_words = ' '.____(sheldon_quotes).split(' ')
# Get number of unique words
unique_words = list(set(all_words))
# Dictionary of indexes as keys and words as values
index_to_word = {____ for i, wd in enumerate(sorted(unique_words))}
print(index_to_word)
# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for ____ in enumerate(sorted(unique_words))}
print(word_to_index)