MulaiMulai sekarang secara gratis

Using lazy processing

Lazy processing operations will usually return in about the same amount of time regardless of the actual quantity of data. Remember that this is due to Spark not performing any transformations until an action is requested.

For this exercise, we'll be defining a Data Frame (aa_dfw_df) and add a couple transformations. Note the amount of time required for the transformations to complete when defined vs when the data is actually queried. These differences may be short, but they will be noticeable. When working with a full Spark cluster with larger quantities of data the difference will be more apparent.

Latihan ini adalah bagian dari kursus

Cleaning Data with PySpark

Lihat Kursus

Petunjuk latihan

  • Load the Data Frame.
  • Add the transformation for F.lower() to the Destination Airport column.
  • Show the Data Frame, noting the time difference for this action to complete.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

# Load the CSV file
aa_dfw_df = ____.____.____('csv').options(Header=True).load('AA_DFW_2018.csv.gz')

# Add the airport column using the F.lower() method
aa_dfw_df = aa_dfw_df.withColumn('airport', ____(aa_dfw_df['Destination Airport']))

# Show the DataFrame
____
Edit dan Jalankan Kode