Stop words and hashing
The next steps will be to remove stop words and then apply the hashing trick, converting the results into a TF-IDF.
A quick reminder about these concepts:
- The hashing trick provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.
- The TF-IDF matrix reflects how important a word is to each document. It takes into account both the frequency of the word within each document but also the frequency of the word across all of the documents in the collection.
The tokenized SMS data are stored in sms
in a column named words
. You've cleaned up the handling of spaces in the data so that the tokenized text is neater.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the
StopWordsRemover
,HashingTF
andIDF
classes. - Create a
StopWordsRemover
object (input columnwords
, output columnterms
). Apply tosms
. - Create a
HashingTF
object (input results from previous step, output columnhash
). Apply towrangled
. - Create an
IDF
object (input results from previous step, output columnfeatures
). Apply towrangled
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.____ import ____, ____, ____
# Remove stop words.
wrangled = ____(inputCol=____, outputCol=____)\
.____(sms)
# Apply the hashing trick
wrangled = ____(____, ____, numFeatures=1024)\
.____(wrangled)
# Convert hashed symbols to TF-IDF
tf_idf = ____(____, ____)\
.____(wrangled).____(wrangled)
tf_idf.select('terms', 'features').show(4, truncate=False)