Loading flights data
In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.
Notes on CSV format:
- fields are separated by a comma (this is the default separator) and
- missing data are denoted by the string 'NA'.
Data dictionary:
mon
— month (integer between 1 and 12)dom
— day of month (integer between 1 and 31)dow
— day of week (integer; 1 = Monday and 7 = Sunday)carrier
— carrier (IATA code)flight
— flight numberorg
— origin airport (IATA code)mile
— distance (miles)depart
— departure time (decimal hour)duration
— expected duration (minutes)delay
— delay (minutes)
pyspark
has been imported for you and the session has been initialized.
Note: The data have been aggressively down-sampled.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Read data from a CSV file called
flights.csv
. Assign data types to columns automatically. Deal with missing data. - How many records are in the data?
- Take a look at the first five records.
- What data types have been assigned to the columns? Do these look correct?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Read data from CSV file
flights = spark.____.____(____,
sep=____,
header=____,
inferSchema=____,
nullValue=____)
# Get number of records
print("The data contain %d records." % flights.____())
# View the first five records
flights.____(5)
# Check column data types
print(flights.____)