Feature hashing and LabelPoint
After splitting the emails into words, our raw data sets 'spam' and 'non-spam' are currently composed of 1-line messages. In order to classify these messages, we need to convert text into features.
In the second part of the exercise, you'll first create a HashingTF()
instance to map text to vectors of 200 features. Then for each message in 'spam' and 'non-spam' files you'll split them into words, and map each word to one feature. These are the features that will be used to decide whether a message is 'spam' or 'non-spam'. Next, you'll create labels for features. For a valid message, the label will be 0 (i.e. the message is not spam) and for a 'spam' message, the label will be 1 (i.e. the message is spam). Finally, you'll combine both labeled datasets.
Remember, you have a SparkContext sc
available in your workspace. Also spam_words
and non_spam_words
variables are already available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create a
HashingTF()
instance to map email text to vectors of 200 features. - Each message in 'spam' and 'non-spam' datasets are split into words, and each word is mapped to one feature.
- Label the features: 1 for spam, 0 for non-spam.
- Combine both the spam and non-spam samples into a single dataset.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a HashingTF instance with 200 features
tf = ____(numFeatures=200)
# Map each word to one feature
spam_features = tf.____(spam_words)
non_spam_features = tf.____(____)
# Label the features: 1 for spam, 0 for non-spam
spam_samples = spam_features.map(lambda features:LabeledPoint(____, features))
non_spam_samples = non_spam_features.map(lambda features:_____(____, features))
# Combine the two datasets
samples = spam_samples.____(non_spam_samples)