Session Ready
Exercise

Feature hashing and LabelPoint

After splitting the emails into words, our raw data set of 'spam' and 'non-spam' is currently composed of 1-line messages consisting of spam and non-spam messages. In order to classify these messages, we need to convert text into features.

In the second part of the exercise, you'll first create a HashingTF() instance to map text to vectors of 200 features, then for each message in 'spam' and 'non-spam' files you'll split them into words, and each word is mapped to one feature. These are the features that will be used to decide whether a message is 'spam' or 'non-spam'. Next, you'll create labels for features. For a valid message, the label will be 0 (i.e. the message is not spam) and for a 'spam' message, the label will be 1 (i.e. the message is spam). Finally, you'll combine both the labeled datasets.

Remember, you have a SparkContext sc available in your workspace. Also spam_words and non_spam_words variables are already available in your workspace.

Instructions
100 XP
  • Create a HashingTF() instance to map email text to vectors of 200 features.
  • Each message in 'spam' and 'non-spam' datasets are split into words, and each word is mapped to one feature.
  • Label the features: 1 for spam, 0 for non-spam.
  • Combine both the spam and non-spam samples into a single dataset.