Loading flights data
In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.
Notes on CSV format:
- fields are separated by a comma (this is the default separator) and
- missing data are denoted by the string 'NA'.
Data dictionary:
mon— month (integer between 1 and 12)dom— day of month (integer between 1 and 31)dow— day of week (integer; 1 = Monday and 7 = Sunday)carrier— carrier (IATA code)flight— flight numberorg— origin airport (IATA code)mile— distance (miles)depart— departure time (decimal hour)duration— expected duration (minutes)delay— delay (minutes)
pyspark has been imported for you and the session has been initialized.
Note: The data have been aggressively down-sampled.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Read data from a CSV file called
flights.csv. Assign data types to columns automatically. Deal with missing data. - How many records are in the data?
- Take a look at the first five records.
- What data types have been assigned to the columns? Do these look correct?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Read data from CSV file
flights = spark.____.____(____,
sep=____,
header=____,
inferSchema=____,
nullValue=____)
# Get number of records
print("The data contain %d records." % flights.____())
# View the first five records
flights.____(5)
# Check column data types
print(flights.____)