Get startedGet started for free

Loading flights data

In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.

Notes on CSV format:

  • fields are separated by a comma (this is the default separator) and
  • missing data are denoted by the string 'NA'.

Data dictionary:

  • mon — month (integer between 1 and 12)
  • dom — day of month (integer between 1 and 31)
  • dow — day of week (integer; 1 = Monday and 7 = Sunday)
  • carrier — carrier (IATA code)
  • flight — flight number
  • org — origin airport (IATA code)
  • mile — distance (miles)
  • depart — departure time (decimal hour)
  • duration — expected duration (minutes)
  • delay — delay (minutes)

pyspark has been imported for you and the session has been initialized.

Note: The data have been aggressively down-sampled.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Read data from a CSV file called flights.csv. Assign data types to columns automatically. Deal with missing data.
  • How many records are in the data?
  • Take a look at the first five records.
  • What data types have been assigned to the columns? Do these look correct?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Read data from CSV file
flights = spark.____.____(____,
                         sep=____,
                         header=____,
                         inferSchema=____,
                         nullValue=____)

# Get number of records
print("The data contain %d records." % flights.____())

# View the first five records
flights.____(5)

# Check column data types
print(flights.____)
Edit and Run Code