CommencerCommencer gratuitement

Aggregating in RDDs

Now that you have conducted analytics with DataFrames in PySpark, let's briefly do a similar task with an RDD. Using the provided code, get the sum of the values of an RDD in PySpark.

A Spark session called spark has already been made for you.

Cet exercice fait partie du cours

Introduction to PySpark

Afficher le cours

Instructions

  • Create an RDD from the provided DataFrame.
  • Apply the provided Lambda Function to the keys of the RDD.
  • Collect the results of the aggregation.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# DataFrame Creation
data = [("HR", "3000"), ("IT", "4000"), ("Finance", "3500")]
columns = ["Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)

# Map the DataFrame to an RDD
rdd = df.rdd.____(lambda row: (row["Department"], row["Salary"]))

# Apply a lambda function to get the sum of the DataFrame
rdd_aggregated = rdd.____(lambda x, y: x + y)

# Show the collected Results
print(rdd_aggregated.____())
Modifier et exécuter le code