Transforming unseen data
When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: fit the vectorizer only on the training data, and apply it to the test data.
For this exercise the speech_df
DataFrame has been split in two:
train_speech_df
: The training set consisting of the first 45 speeches.test_speech_df
: The test set consisting of the remaining speeches.
This exercise is part of the course
Feature Engineering for Machine Learning in Python
Exercise instructions
- Instantiate
TfidfVectorizer
. - Fit the vectorizer and apply it to the
text_clean
column. - Apply the same vectorizer on the
text_clean
column of the test data. - Create a DataFrame of these new features from the test set.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Instantiate TfidfVectorizer
tv = ____(max_features=100, stop_words='english')
# Fit the vectroizer and transform the data
tv_transformed = ____
# Transform test data
test_tv_transformed = ____
# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.____,
columns=tv.____).add_prefix('TFIDF_')
print(test_tv_df.head())