Intro to batch processing
1. Intro to batch processing
Welcome to streaming concepts, a course designed to introduce you to methods of processing streams of data. I'm Mike Metzger, a data engineer, and I will be your instructor for this course. Before we can discuss how streaming works, we need to review a more traditional method known as batch processing, which we will cover in this first chapter.2. What is batch processing?
Let's talk about some characteristics of batch processing. First, batching means we're processing data in groups. We take an amount of data (we'll discuss this more in a moment) and process it together in a batch. Batch processing runs from the start of our task to the end of that task. This means that we don't add new data to the batch once it's started. Typically, we only end when we've successfully completed the batch or encountered a problem. Usually, batches are started on an interval (hourly, daily, monthly, etc) or as a result of some starting event (manually initiating a process). Batches also have an amount of data associated, typically referred to as the batch size. In general terms, the batch size is how much data is being processed at that time. This could be 100Kb of a file, or a million images. For our purposes, the concept of a batch size is more important to us than the specifics. Finally, an instance of a batch process is often called a 'job'. While not absolutely the case, this can be a useful shortcut when determining if a data processing task is batch-oriented.3. Common batch processing scenarios
Let's take a look at a few common batch computing scenarios that you use daily. Anytime our computer reads a file or a portion of a file, this is usually done in a batch fashion. In this case it's often called the chunk size, but it represents the same idea of processing a certain portion of data before moving onto the next. When we send or receive email, this is done in batches. Our email client can check if new messages are available and will update accordingly. The batch size in this case varies based on the number of messages, but it is still done in chunks. In the case of sending messages, we can create several to send at once and store them until we can transmit them (consider while traveling on a plane). These can then be sent in a single batch when connectivity returns. Another common batch scenario is printing. A specific print job would represent the number of pages to be printed, how many copies, and other characteristics. If you sent multiple print jobs, each would be completed prior to starting another (otherwise the pages would be interleaved between multiple jobs and make little sense.)4. Why batch?
So why are batch processes commonly used? The idea of batch processes is well understood and fairly easy to explain. Most tasks since the beginning of computing have fallen under batch processing (originally computer users submitted processing jobs to an actual computer administrator on punch cards). A batch process typically works the same way each time and is usually consistent in its behavior. There are definitely times when it's not the right tool and we'll discuss other options (like streaming) in the next chapters of this course. Finally, it is reasonably straightforward to improve performance in batch processes. This is known as scaling the process and we'll discuss this further in the next lesson.5. Let's practice!
OK, let's make sure you understand what batch processing consists of, as well as its advantages and applications, with a couple of exercises.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.