Get startedGet started for free

Limiting your features

As you have seen, using the CountVectorizer with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.

For this purpose CountVectorizer has parameters that you can set to reduce the number of features:

  • min_df : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.
  • max_df : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as "and" or "the".

This exercise is part of the course

Feature Engineering for Machine Learning in Python

View Course

Exercise instructions

  • Limit the number of features in the CountVectorizer by setting the minimum number of documents a word can appear to 20% and the maximum to 80%.
  • Fit and apply the vectorizer on text_clean column in one step.
  • Convert this transformed (sparse) array into a numpy array with counts.
  • Print the dimensions of the new reduced array.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Specify arguements to limit the number of features generated
cv = ____

# Fit, transform, and convert into array
cv_transformed = ____(speech_df['text_clean'])
cv_array = ____

# Print the array shape
print(____)
Edit and Run Code