Aan de slagGa gratis aan de slag

RDDs from Parallelized collections

Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It is an immutable distributed collection of objects. Since RDD is a fundamental and backbone data type in Spark, it is important that you understand how to create it. In this exercise, you'll create your first RDD in PySpark from a collection of words.

Remember, you already have a SparkContext sc available in your workspace.

Deze oefening maakt deel uit van de cursus

Big Data Fundamentals with PySpark

Cursus bekijken

Oefeninstructies

  • Create a RDD named RDD from a Python list of words.
  • Confirm the object created is RDD.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Create an RDD from a list of words
RDD = sc.____(["Spark", "is", "a", "framework", "for", "Big Data processing"])

# Print out the type of the created object
print("The type of RDD is", ____(RDD))
Code bewerken en uitvoeren