What is Big Data?
1. Fundamentals of Big Data
Welcome to the first video of Big Data fundamentals via PySpark course. My name is Upendra Devisetty and I am a Science Analyst at CyVerse. Let's get started.2. What is Big Data?
What exactly is Big Data? There is no single definition of Big Data because projects, vendors, practitioners, and business professionals use it quite differently. According to Wikipedia - Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing software. There are three3. The 3 V's of Big Data
Vs of Big data that are used to describe its characteristics. They are volume, velocity, and variety. Volume refers to the size of data. Variety refers to different sources and formats of data. Velocity is the speed at which data is generated and available for processing. Now let's take a look at some4. Big Data concepts and Terminology
of the concepts and terminology of Big Data. Clustered computing is the pooling of resources of multiple machines to complete jobs. Parallel computing is a type of computation in which many calculations are carried out simultaneously. A distributed computing involves nodes or networked computers that run jobs in parallel. Batch processing refers to the breaking data into smaller pieces and running each piece on an individual machine. Real-time processing demands that information is processed and made ready immediately. There are two popular5. Big Data processing systems
frameworks for Big Data processing. The first is the highly successful Hadoop/MapReduce framework. Hadoop/MapReduce framework is open source and scalable framework for batch data. The second is the most popular Apache Spark which is a parallel framework for storing and processing of Big Data across clustered computers. It is also open source and is suited for both batch and real-time data processing. In this course, you'll learn about Apache Spark. Let's talk about the main6. Features of Apache Spark framework
features of Apache Spark. Spark distributes data and computation across multiple computers executing complex multi-stage applications such as machine learning. Spark runs most computations in memory and thereby provides better performance for applications such as interactive data mining. Spark helps to run an application up to 100 times faster in memory, and 10 times faster when running on disk. Spark is mainly written in Scala language but also have support for Java, Python, R, and SQL. Apache Spark is a7. Apache Spark Components
powerful alternative to Hadoop MapReduce, with rich features like machine learning, real-time stream processing, and graph computations. At the center of the ecosystem is the Spark Core which contains the basic functionality of Spark. The rest of Spark’s libraries are built on top of it. First is Spark SQL, which is a library for processing structured and semi-structured data in Python, Java, and Scala. The second is MLlib, which is a library of common machine learning algorithms. The third component is GraphX, which is a collection of algorithms and tools for manipulating graphs and performing parallel graph computations. Finally, Spark Streaming is a scalable, high-throughput processing library for real-time data. In this course, you'll learn about SparkSQL and MLlib.8. Spark modes of deployment
Spark can be run on two modes. The first is the local mode where you can run Spark on a single machine such as your laptop. The local mode is very convenient for testing, debugging and demonstration purposes. The second is the cluster mode where Spark is run on a cluster. The cluster mode is mainly used for production. The development workflow is that you start on local mode and transition to cluster mode. During the transition from local to cluster mode, no code change is necessary. In this course, you'll be using local mode.9. Coming up next - PySpark
In the next video, you'll learn about PySpark which is the Python API for Spark.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.