1. Parallel computation frameworks
Hi again. Now that we've seen the basics of parallel computing, it's time to talk about specific parallel computing frameworks. We'll focus on parallel computing frameworks that are currently hot in the data engineering world.
2. Hadoop
Before we dive into state of the art big data processing tools, let's talk a bit about the foundations. If you've done some investigation into big data ecosystems, you've probably heard about Hadoop. Hadoop is a collection of open source projects, maintained by the Apache Software Foundation. Some of them are a bit outdated, but it's still relevant to talk about them. There are two Hadoop projects we want to focus on for this video: MapReduce and HDFS.
3. HDFS
First, HDFS is a distributed file system. It's similar to the file system you have on your computer, the only difference being the files reside on multiple different computers. HDFS has been essential in the big data world, and for parallel computing by extension. Nowadays, cloud-managed storage systems like Amazon S3 often replace HDFS.
4. MapReduce
Second, MapReduce was one of the first popularized big-data processing paradigms. It works similar to the example you saw in the previous video, where the program splits tasks into subtasks, distributing the workload and data between several processing units. For MapReduce, these processing units are several computers in the cluster. MapReduce had its flaws; one of it was that it was hard to write these MapReduce jobs. Many software programs popped up to address this problem, and one of them was Hive.
5. Hive
Hive is a layer on top of the Hadoop ecosystem that makes data from several sources queryable in a structured way using Hive's SQL variant: Hive SQL. Facebook initially developed Hive, but the Apache Software Foundation now maintains the project. Although MapReduce was initially responsible for running the Hive jobs, it now integrates with several other data processing tools.
6. Hive: an example
Let's look at an example: this Hive query selects the average age of the Olympians per Year they participated. As you'd expect, this query looks indistinguishable from a regular SQL query. However, behind the curtains, this query is transformed into a job that can operate on a cluster of computers.
7. Spark
The other parallel computation framework we'll introduce is called Spark. Spark distributes data processing tasks between clusters of computers. While MapReduce-based systems tend to need expensive disk writes between jobs, Spark tries to keep as much processing as possible in memory. In that sense, Spark was also an answer to the limitations of MapReduce. The disk writes of MapReduce were especially limiting in interactive exploratory data analysis, where each step builds on top of a previous step. Spark originates from the University of California, where it was developed at the Berkeley's AMPLab. Currently, the Apache Software Foundation maintains the project.
8. Resilient distributed datasets (RDD)
Spark's architecture relies on something called resilient distributed datasets, or RDDs. Without diving into technicalities, this is a data structure that maintains data which is distributed between multiple nodes. Unlike DataFrames, RDDs don't have named columns. From a conceptual perspective, you can think of RDDs as lists of tuples. We can do two types of operations on these data structures: transformations, like map or filter, and actions, like count or first. Transformations result in transformed RDDs, while actions result in a single result.
9. PySpark
When working with Spark, people typically use a programming language interface like PySpark. PySpark is the Python interface to spark. There are interfaces to Spark in other languages, like R or Scala, as well. PySpark hosts a DataFrame abstraction, which means that you can do operations very similar to pandas DataFrames. PySpark and Spark take care of all the complex parallel computing operations.
10. PySpark: an example
Have a look at the following PySpark example. Similar to the Hive Query you saw before, it calculates the mean age of the Olympians, per Year of the Olympic event. Instead of using the SQL abstraction, like in the Hive Example, it uses the DataFrame abstraction.
11. Let's practice!
Let's try this in the exercises!