BaşlayınÜcretsiz Başlayın

IDs with different partitions

You've just completed adding an ID field to a DataFrame. Now, take a look at what happens when you do the same thing on DataFrames containing a different number of partitions.

To check the number of partitions, use the method .rdd.getNumPartitions() on a DataFrame.

The spark session and two DataFrames, voter_df and voter_df_single, are available in your workspace. The instructions will help you discover the difference between the DataFrames. The pyspark.sql.functions library is available under the alias F.

Bu egzersiz

Cleaning Data with PySpark

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Print the number of partitions on each DataFrame.
  • Add a ROW_ID field to each DataFrame.
  • Show the top 10 IDs in each DataFrame.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Print the number of partitions in each DataFrame
print("\nThere are %d partitions in the voter_df DataFrame.\n" % ____)
print("\nThere are %d partitions in the voter_df_single DataFrame.\n" % ____)

# Add a ROW_ID field to each DataFrame
voter_df = voter_df.____('ROW_ID', ____)
voter_df_single = ____

# Show the top 10 IDs in each DataFrame 
voter_df.____(voter_df.____.desc()).show(____)
____.orderBy(____).show(10)
Kodu Düzenle ve Çalıştır