Get startedGet started for free

Removing columns and rows

You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:

  1. removing an uninformative column and
  2. removing rows which do not have information about whether or not a flight was delayed.

The data are available as flights.

Note:: You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Remove the flight column.
  • Find out how many records have missing values in the delay column.
  • Remove records with missing values in the delay column.
  • Remove records with missing values in any column and get the number of remaining rows.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Remove the 'flight' column
flights_drop_column = flights.____(____)

# Number of records with missing 'delay' values
flights_drop_column.____('delay IS NULL').____()

# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.____(____)

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.____()
print(flights_none_missing.____())
Edit and Run Code