MulaiMulai sekarang secara gratis

Pandafy a Spark DataFrame

Suppose you've run a query on your huge dataset and aggregated it down to something a little more manageable.

Sometimes it makes sense to then take that table and work with it locally using a tool like pandas. Spark DataFrames make that easy with the .toPandas() method. Calling this method on a Spark DataFrame returns the corresponding pandas DataFrame. It's as simple as that!

This time the query counts the number of flights to each airport from SEA and PDX.

Remember, there's already a SparkSession called spark in your workspace!

Latihan ini adalah bagian dari kursus

Foundations of PySpark

Lihat Kursus

Petunjuk latihan

  • Run the query using the .sql() method. Save the result in flight_counts.
  • Use the .toPandas() method on flight_counts to create a pandas DataFrame called pd_counts.
  • Print the .head() of pd_counts to the console.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

# Don't change this query
query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"

# Run the query
flight_counts = ____

# Convert the results to a pandas DataFrame
pd_counts = ____

# Print the head of pd_counts
print(____)
Edit dan Jalankan Kode