IniziaInizia gratis

Loading SMS spam data

You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.

The file sms.csv contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the UCI Machine Learning Repository. There are a total of 5574 SMS, of which 747 have been labelled as spam.

Notes on CSV format:

  • no header record and
  • fields are separated by a semicolon (this is not the default separator).

Data dictionary:

  • id — record identifier
  • text — content of SMS message
  • label — spam or ham (integer; 0 = ham and 1 = spam)

Questo esercizio fa parte del corso

Machine Learning with PySpark

Visualizza il corso

Istruzioni dell'esercizio

  • Specify the data schema, giving columns names ("id", "text", and "label") and column types.
  • Read data from a delimited file called "sms.csv".
  • Print the schema for the resulting DataFrame.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("____", IntegerType()),
    ____("____", ____()),
    ____("____", ____())
])

# Load data from a delimited file
sms = spark.read.csv(____, sep=____, header=____, ____=____)

# Print schema of DataFrame
sms.____()
Modifica ed esegui il codice