IniziaInizia gratis

Stopwords and hashing

The next steps will be to remove stopwords and then apply the hashing trick, converting the results into a TF-IDF.

A quick reminder about these concepts:

  • The hashing trick provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.
  • The TF-IDF matrix reflects how important a word is to each document. It takes into account both the frequency of the word within each document but also the frequency of the word across all of the documents in the collection.

The tokenized SMS data are stored in sms in a column named words. You've cleaned up the handling of spaces in the data so that the tokenized text is neater.

Questo esercizio fa parte del corso

Machine Learning with PySpark

Visualizza il corso

Istruzioni dell'esercizio

  • Import the StopWordsRemover, HashingTF and IDF classes.
  • Create a StopWordsRemover object (input column words, output column terms). Apply to sms.
  • Create a HashingTF object (input results from previous step, output column hash). Apply to wrangled.
  • Create an IDF object (input results from previous step, output column features). Apply to wrangled.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

from pyspark.ml.____ import ____, ____, ____

# Remove stopwords
wrangled = ____(inputCol=____, outputCol=____)\
      .____(sms)

# Apply the hashing trick
wrangled = ____(____, ____, numFeatures=1024)\
      .____(wrangled)

# Convert hashed symbols to TF-IDF
tf_idf = ____(____, ____)\
      .____(wrangled).____(wrangled)
      
tf_idf.select('terms', 'features').show(4, truncate=False)
Modifica ed esegui il codice