Exploring text vectors, part 2
Using the return_weights()
function you wrote in the previous exercise, you're now going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Call
return_weights()
to return the top weighted words for that document. - Call
set()
on the returnedfilter_list
to remove duplicated numbers. - Call
words_to_filter
, passing in the following parameters:vocab
for thevocab
parameter,tfidf_vec.vocabulary_
for theoriginal_vocab
parameter,text_tfidf
for thevector
parameter, and3
to grab thetop_n
3 weighted words from each document. - Finally, pass that
filtered_words
set into a list to use as a filter for the text vector.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def words_to_filter(vocab, original_vocab, vector, top_n):
filter_list = []
for i in range(0, vector.shape[0]):
# Call the return_weights function and extend filter_list
filtered = ____(vocab, original_vocab, vector, i, top_n)
filter_list.extend(filtered)
# Return the list in a set, so we don't get duplicate word indices
return ____(filter_list)
# Call the function to get the list of word indices
filtered_words = ____(____, ____, ____, ____)
# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(____)]