ComenzarEmpieza gratis

Transforming unseen data

When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: fit the vectorizer only on the training data, and apply it to the test data.

For this exercise the speech_df DataFrame has been split in two:

  • train_speech_df: The training set consisting of the first 45 speeches.
  • test_speech_df: The test set consisting of the remaining speeches.

Este ejercicio forma parte del curso

Feature Engineering for Machine Learning in Python

Ver curso

Instrucciones del ejercicio

  • Instantiate TfidfVectorizer.
  • Fit the vectorizer and apply it to the text_clean column.
  • Apply the same vectorizer on the text_clean column of the test data.
  • Create a DataFrame of these new features from the test set.

Ejercicio interactivo práctico

Prueba este ejercicio completando el código de muestra.

# Instantiate TfidfVectorizer
tv = ____(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = ____

# Transform test data
test_tv_transformed = ____

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.____, 
                          columns=tv.____).add_prefix('TFIDF_')
print(test_tv_df.head())
Editar y ejecutar código