Loading spam and non-spam data
Logistic Regression is a popular method to predict a categorical response. Probably one of the most common applications of the logistic regression is the message or email spam classification. In this 3-part exercise, you'll create an email spam classifier with logistic regression using Spark MLlib. Here are the brief steps for creating a spam classifier.
- Create an RDD of strings representing email.
- Run MLlib’s feature extraction algorithms to convert text into an RDD of vectors.
- Call a classification algorithm on the RDD of vectors to return a model object to classify new points.
- Evaluate the model on a test dataset using one of MLlib’s evaluation functions.
In the first part of the exercise, you'll load the 'spam' and 'ham' (non-spam) files into RDDs, split the emails into individual words, and look at the first element in each of the RDD.
Remember, you have a SparkContext sc
available in your workspace. Also file_path_spam
variable (which is the path to the 'spam' file) and file_path_non_spam
(which is the path to the 'non-spam' file) is already available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create two RDDS, one for 'spam' and one for 'non-spam (ham)'.
- Split each email in 'spam' and 'non-spam' RDDs into words.
- Print the first element in the split RDD of both 'spam' and 'non-spam'.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the datasets into RDDs
spam_rdd = sc.____(file_path_spam)
non_spam_rdd = sc.____(file_path_non_spam)
# Split the email messages into words
spam_words = spam_rdd.____(lambda email: email.split(' '))
non_spam_words = non_spam_rdd.____(lambda email: ____.____(' '))
# Print the first element in the split RDD
print("The first element in spam_words is", spam_words.____())
print("The first element in non_spam_words is", ____.____())