1. Read Data in Batches
Data Sources and Data Assets help us create the infrastructure to read our data into Great Expectations, but we access the actual data by creating Batch Definitions. Let's learn how to use them.
2. Kaggle Weather Data
Before we begin, let's introduce the data we'll be working with. For this video and the subsequent exercises, we'll be using the Weather Data dataset from Kaggle. We printed the first and last rows here.
3. Batch Definitions
Now, a Batch Definition specifies a selection of records -— called a Batch -— from a Data Asset. It describes how the data within a Data Asset should be organized for retrieval.
We can create a Batch Definition using the Data Asset's `.add_batch_definition_whole_dataframe()` method, passing in the desired name of the Batch Definition to the `name` parameter. Notice here that we called the Python variable `batch_definition`, while the `name` parameter is set to the string `"my_batch_definition"`. We discussed this distinction in the last video.
4. Batches
A Batch is a slice of a Data Asset based on a desired specification, defined in our Batch Definition. It represents a group of records that validations can be run on.
We can get a Batch from the Batch Definition using its `.get_batch()` method. We pass in to the `batch_parameters` keyword a dictionary containing the string `"dataframe"` as a key and our pandas DataFrame as the value.
5. Batches
It's worth noting that with other Data Source types, such as Spark or SQL, the data is actually passed in to the Data Asset. However, because pandas DataFrames are in-memory, we pass them in at runtime to the Batch itself.
6. The Batch object
When we create a Batch Definition from a pandas Data Source, like we did here, then the Batch object has some similarities to a pandas DataFrame. For example, we can call the `.head()` method on the Batch to get the first few rows of our DataFrame.
7. The Batch object
We can also set the `fetch_all` parameter of the `batch.head()` method to True, to view the entire DataFrame, along with its shape.
8. The Batch object
We can use the `.columns()` method of the Batch to get the DataFrame columns too. Note that this is a method in Great Expectations, as opposed to an attribute in pandas, so we need to include the parentheses.
9. Cheat sheet
We've covered a lot in this chapter. To summarize, the pipeline for connecting to data in Great Expectations involves three main steps: adding a pandas Data Source, creating a Data Asset that can connect to a pandas DataFrame, and building a Batch Definition using the Data Asset's `.add_batch_definition_whole_dataframe()` method. Batch Definitions establish Batches, which are collections of records from the Data Asset that are used to validate data. The Batch Definition's `.get_batch()` method allows us to access our Batch. The Batch's `.head()` and `.columns()` methods allow us to view our DataFrame rows and column names, respectively. Feel free to refer back to this overview as you work with Batch Definitions and Batches in the exercises.
10. Let's practice!
Now it's your turn to practice reading Batches of data into Great Expectations using your own Batch Definitions.