Get startedGet started for free

Selecting features using text vectors

1. Selecting features using text vectors

Previously, we used scikit-learn to create a tf-idf vector of text from of a dataset, but we don't necessarily need the entire vector to train a model. We could potentially select something like the top 20% of weighted words across the vector. This is a scenario where iteration is important, and it may be helpful to test out different subsets of the tf-idf vector to see what works. Rather than just blindly taking some top percentage of a tf-idf vector, let's look at how to pull out the words and their weights on a per document basis. It isn't especially straightforward to do this in scikit-learn, but it's very useful. Let's walk through the different parts we'll need to do that.

2. Looking at word weights

After we've vectorized our text, the vocabulary and weights will be stored in the vectorizer. To pull out the vocabulary list, which we'll need to look at word weights, we can use the vocabulary-underscore attribute. Here we have a vector of location descriptions from the hiking dataset, and here are the first few words in the vocabulary. Let's also take a look at the row data from the vector itself. Row data contains two components we'll need: the word weights and the index of the word. To take a look at the weight of the fourth row, for example, we use the data attribute on a specific row, accessed with square bracket subsetting. To get the indices of the words that have been weighted, we use the indices attribute.

3. Looking at word weights

Before putting together the vocabulary, the word indices, and their weights, we want to reverse the key value pairs in the vocabulary. It'll be easier later on if we have the index number in the key position in the dictionary. To reverse the vocabulary dictionary, we can swap the key-value pairs by grabbing the items from the vocabulary dictionary and reversing the order. If we take a look, we can see that this worked. Finally, we can also zip together the row indices and weights, pass it into the dict function, and turn that into a dictionary.

4. Looking at word weights

Let's pull this together into a function. We'll pass in the reversed vocab list, the vector, and the row we want to retrieve data for. We'll do row zipping to a dictionary in the function. And finally, we'll return a dictionary mapping the word to its score. So if we pass in the reversed vocabulary list (vocab), the text_tfidf vector, and the index for the 4th row (3), we now have a mapping of scores to words. At this point we could sort by score, or eliminate the words below a certain threshold.

5. Let's practice!

In the exercise, you'll work on using this knowledge to reduce the text feature set.