Loading spam and non-spam data

Logistic Regression is a popular method to predict a categorical response. Probably one of the most common applications of the logistic regression is the message or email spam classification. In this 3-part exercise, you'll create an email spam classifier with logistic regression using Spark MLlib. Here are the brief steps for creating a spam classifier.

Create an RDD of strings representing email.
Run MLlib’s feature extraction algorithms to convert text into an RDD of vectors.
Call a classification algorithm on the RDD of vectors to return a model object to classify new points.
Evaluate the model on a test dataset using one of MLlib’s evaluation functions.

In the first part of the exercise, you'll load the 'spam' and 'ham' (non-spam) files into RDDs, split the emails into individual words, and look at the first element in each of the RDD.

Remember, you have a SparkContext sc available in your workspace. Also file_path_spam variable (which is the path to the 'spam' file) and file_path_non_spam (which is the path to the 'non-spam' file) is already available in your workspace.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

Create two RDDS, one for 'spam' and one for 'non-spam (ham)'.
Split each email in 'spam' and 'non-spam' RDDs into words.
Print the first element in the split RDD of both 'spam' and 'non-spam'.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load the datasets into RDDs
spam_rdd = sc.____(file_path_spam)
non_spam_rdd = sc.____(file_path_non_spam)

# Split the email messages into words
spam_words = spam_rdd.____(lambda email: email.split(' '))
non_spam_words = non_spam_rdd.____(lambda email: ____.____(' '))

# Print the first element in the split RDD
print("The first element in spam_words is", spam_words.____())
print("The first element in non_spam_words is", ____.____())

Edit and Run Code