Get startedGet started for free

Quick pipeline

Before you parse some more complex data, your manager would like to see a simple pipeline example including the basic steps. For this example, you'll want to ingest a data file, filter a few rows, add an ID column to it, then write it out as JSON data.

The spark context is defined, along with the pyspark.sql.functions library being aliased as F as is customary.

This exercise is part of the course

Cleaning Data with PySpark

View Course

Exercise instructions

  • Import the file 2015-departures.csv.gz to a DataFrame. Note the header is already defined.
  • Filter the DataFrame to contain only flights with a duration over 0 minutes. Use the index of the column, not the column name (remember to use .printSchema() to see the column names / order).
  • Add an ID column.
  • Write the file out as a JSON document named output.json.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the data to a DataFrame
departures_df = spark.____(____, header=____)

# Remove any duration of 0
departures_df = departures_df.____(____)

# Add an ID column
departures_df = departures_df.____('id', ____)

# Write the file out to JSON format
____.write.____(____, mode='overwrite')
Edit and Run Code