Aan de slagGa gratis aan de slag

Quick pipeline

Before you parse some more complex data, your manager would like to see a simple pipeline example including the basic steps. For this example, you'll want to ingest a data file, filter a few rows, add an ID column to it, then write it out as JSON data.

The spark context is defined, along with the pyspark.sql.functions library being aliased as F as is customary.

Deze oefening maakt deel uit van de cursus

Cleaning Data with PySpark

Cursus bekijken

Oefeninstructies

  • Import the file 2015-departures.csv.gz to a DataFrame. Note the header is already defined.
  • Filter the DataFrame to contain only flights with a duration over 0 minutes. Use the index of the column, not the column name (remember to use .printSchema() to see the column names / order).
  • Add an ID column.
  • Write the file out as a JSON document named output.json.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Import the data to a DataFrame
departures_df = spark.____(____, header=____)

# Remove any duration of 0
departures_df = departures_df.____(____)

# Add an ID column
departures_df = departures_df.____('id', ____)

# Write the file out to JSON format
____.write.____(____, mode='overwrite')
Code bewerken en uitvoeren