Removing columns and rows
You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.
In this exercise you need to trim those data down by:
- removing an uninformative column and
- removing rows which do not have information about whether or not a flight was delayed.
The data are available as flights
.
Note:: You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Remove the
flight
column. - Find out how many records have missing values in the
delay
column. - Remove records with missing values in the
delay
column. - Remove records with missing values in any column and get the number of remaining rows.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Remove the 'flight' column
flights_drop_column = flights.____(____)
# Number of records with missing 'delay' values
flights_drop_column.____('delay IS NULL').____()
# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.____(____)
# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.____()
print(flights_none_missing.____())