Loading flights data

In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.

Notes on CSV format:

fields are separated by a comma (this is the default separator) and
missing data are denoted by the string 'NA'.

Data dictionary:

mon — month (integer between 1 and 12)
dom — day of month (integer between 1 and 31)
dow — day of week (integer; 1 = Monday and 7 = Sunday)
carrier — carrier (IATA code)
flight — flight number
org — origin airport (IATA code)
mile — distance (miles)
depart — departure time (decimal hour)
duration — expected duration (minutes)
delay — delay (minutes)

pyspark has been imported for you and the session has been initialized.

Note: The data have been aggressively down-sampled.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

Read data from a CSV file called flights.csv. Assign data types to columns automatically. Deal with missing data.
How many records are in the data?
Take a look at the first five records.
What data types have been assigned to the columns? Do these look correct?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Read data from CSV file
flights = spark.____.____(____,
                         sep=____,
                         header=____,
                         inferSchema=____,
                         nullValue=____)

# Get number of records
print("The data contain %d records." % flights.____())

# View the first five records
flights.____(5)

# Check column data types
print(flights.____)

Edit and Run Code