Word counts

1. Word Count Representation

Once high level information has been recorded you can begin creating features based on the actual content of each text.

2. Text to columns

The most common approach to this is to create a column for each word and record the number of times each particular word appears in each text. This results in a set of columns equal in width to the number of unique words in the dataset, with counts filling each entry. Taking just one sentence we can see that "of" occurs 3 tines, "the" 2 times and the other words once.

3. Initializing the vectorizer

While you could of course write a script to do this counting yourself, scikit-learn already has this functionality built in with its CountVectorizer class. As usual, first import CountVectorizer from sklearn dot feature_extraction dot text, then instantiate it by assigning it to a variable name, cv in this case.

4. Specifying the vectorizer

It may have become apparent that creating a column for every word will result in far too many values for analyses. Thankfully, you can specify arguments when initializing your CountVectorizer to limit this. For example, you can specify the minimum number of texts that a word must be contained in using the argument min_df. If a float is given, the word must appear in at least this percent of documents. This threshold eliminates words that occur so rarely that they would not be useful when generalizing to new texts. Conversely, max_df limits words to only ones that occur below a certain percentage of the data. This can be useful to remove words that occur too frequently to be of any value.

5. Fit the vectorizer

Once the vectorizer has been instantiated you can then fit it on the data you want to create your features around. This is done by calling the fit() method on relevant column.

6. Transforming your text

Once the vectorizer has been fit you can call the transform() method on the column you want to transform. This outputs a sparse array, with a row for every text and a column for every word that has been counted.

7. Transforming your text

To transform this to a non sparse array you can use the toarray() method.

8. Getting the features

You may notice that the output is an array, with no concept of column names. To get the names of the features that have been generated you can call the get_feature_names() method on the vectorizer which returns a list of the features generated, in the same order that the columns of the converted array are in.

9. Fitting and transforming

As an aside, while fitting and transforming separately can be useful, particularly when you need to transform a different dataset than the one that you fit the vectorizer to, you can accomplish both steps at once using the fit_transform() method.

10. Putting it all together

Now that you have an array containing the count values of each of the words of interest, and a way to get the feature names you can combine these in a DataFrame as shown here. The add_prefix() method allows you to be able to distinguish these columns in the future.

11. Updating your DataFrame

You can now combine this DataFrame with your original DataFrame so they can be used to generate future analytical models using pandas concat method. Checking the DataFrames shape shows the new much wider size. Remember to specify the axis argument to 1 as you want column bind these DataFrames.

12. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Feature Engineering for Machine Learning in Python

IntermediateSkill Level

4.8+

651 reviews