Exploring text vectors, part 1
Let's expand on the text vector exploration method we just learned about, using the volunteer
dataset's title
tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf
vector.
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Add parameters called
original_vocab
, for thetfidf_vec.vocabulary_
, andtop_n
. - Call
pd.Series()
on the zipped dictionary. This will make it easier to operate on. - Use the
.sort_values()
function to sort the series and slice the index up totop_n
words. - Call the function, setting
original_vocab=tfidf_vec.vocabulary_
, settingvector_index=8
to grab the 9th row, and settingtop_n=3
, to grab the top 3 weighted words.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Add in the rest of the arguments
def return_weights(vocab, ____, vector, vector_index, ____):
zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
# Transform that zipped dict into a series
zipped_series = ____({vocab[i]:zipped[i] for i in vector[vector_index].indices})
# Sort the series to pull out the top n weighted words
zipped_index = zipped_series.____(ascending=False)[:____].index
return [original_vocab[i] for i in zipped_index]
# Print out the weighted words
print(return_weights(vocab, ____, text_tfidf, ____, ____))