Advanced RDD Actions

1. More actions

Previously you learned about advanced RDD Transformations for key/value datasets. Similar to advanced RDD Transformations there are advanced RDD Actions which you'll see in this video.

2. reduce() action

Reduce action takes in a function which operates on two elements of the same type of RDD and returns a new element of the same type. The function should be commutative and associative so that it can be computed correctly in parallel. A simple example of such a function is +, which we can use to sum our RDD. Here is an example of reduce action that calculates the sum of all the elements in an RDD. In this example, input RDD is first created using SparkContext's parallelize method on a list consisting of numbers 1,3,4,6. Eexcuting reduce action results in 14 which is the sum of 1,3,4,6.

3. saveAsTextFile() action

In many cases, it is not advisable to run collect action on RDDs because of the huge size of the data. In these cases, it’s common to write data out to a distributed storage systems such as HDFS or Amazon S3. saveAsTextFile action can be used to save RDD as a text file inside a particular directory. By default, saveAsTextFile saves RDD with each partition as a separate file inside a directory. Here is an example of saveAsTextFile that saves an RDD with each partition as a separate file inside a directory. However, you can change it to return a new RDD that is reduced into a single partition using the coalesce method. Here is an example of saveAsTextFile that saves RDD as a single file inside a directory. Similar to

4. Action Operations on pair RDDs

pair RDD Transformations, there are also RDD Actions available for pair RDDs. However, pair RDDs also attain some additional actions of PySpark especially those that leverage the advantage of data which is of key-value nature. Let’s take a look at two pair RDD actions - countByKey and collectAsMap in this video.

5. countByKey() action

countByKey is only available on RDDs of type (Key, Value). With the countByKey operation, we can count the number of elements for each key. Here is an example of counting the number of values for each key in the dataset. In this example, we first create a pair RDD named rdd using SparkContext's parallelize method. Since countByKey generates a dictionary, next we iterate over the dictionary to print the each unique and number of values associated with each key as shown here. One thing to note is that countByKey should only be used on a dataset whose size is small enough to fit in memory. collectAsMap

6. collectAsMap() action

returns the key-value pairs in the RDD to the as a dictionary. Here is an example of collectAsMap on a pair RDD. As before we create a pair RDD using SparkContext's parallelize method and next use collectAsMap action. collectAsMap produces the key-value pairs in the RDD as a dictionary which can be used for downstream analysis. Similar to countByKey, this action should only be used if the resulting data is expected to be small, as all the data is loaded into the memory. Let's practice

7. Let's practice

some of these advanced Actions on some test data in PySpark shell.