Loading SMS spam data
You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.
The file sms.csv
contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the UCI Machine Learning Repository. There are a total of 5574 SMS, of which 747 have been labelled as spam.
Notes on CSV format:
- no header record and
- fields are separated by a semicolon (this is not the default separator).
Data dictionary:
id
— record identifiertext
— content of SMS messagelabel
— spam or ham (integer; 0 = ham and 1 = spam)
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Specify the data schema, giving columns names (
"id"
,"text"
, and"label"
) and column types. - Read data from a delimited file called
"sms.csv"
. - Print the schema for the resulting DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Specify column names and types
schema = StructType([
StructField("____", IntegerType()),
____("____", ____()),
____("____", ____())
])
# Load data from a delimited file
sms = spark.read.csv(____, sep=____, header=____, ____=____)
# Print schema of DataFrame
sms.____()