Get startedGet started for free

Transforming unseen data

When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: fit the vectorizer only on the training data, and apply it to the test data.

For this exercise the speech_df DataFrame has been split in two:

  • train_speech_df: The training set consisting of the first 45 speeches.
  • test_speech_df: The test set consisting of the remaining speeches.

This exercise is part of the course

Feature Engineering for Machine Learning in Python

View Course

Exercise instructions

  • Instantiate TfidfVectorizer.
  • Fit the vectorizer and apply it to the text_clean column.
  • Apply the same vectorizer on the text_clean column of the test data.
  • Create a DataFrame of these new features from the test set.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Instantiate TfidfVectorizer
tv = ____(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = ____

# Transform test data
test_tv_transformed = ____

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.____, 
                          columns=tv.____).add_prefix('TFIDF_')
print(test_tv_df.head())
Edit and Run Code