Get startedGet started for free

Exploring text vectors, part 2

Using the return_weights() function you wrote in the previous exercise, you're now going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

This exercise is part of the course

Preprocessing for Machine Learning in Python

View Course

Exercise instructions

  • Call return_weights() to return the top weighted words for that document.
  • Call set() on the returned filter_list to remove duplicated numbers.
  • Call words_to_filter, passing in the following parameters: vocab for the vocab parameter, tfidf_vec.vocabulary_ for the original_vocab parameter, text_tfidf for the vector parameter, and 3 to grab the top_n 3 weighted words from each document.
  • Finally, pass that filtered_words set into a list to use as a filter for the text vector.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = ____(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return ____(filter_list)

# Call the function to get the list of word indices
filtered_words = ____(____, ____, ____, ____)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(____)]
Edit and Run Code