1. Dask DataFrames
Dask also has a specialized subpackage for working with tabular datasets.
2. pandas DataFrames vs. Dask DataFrames
This is the dask-dot-dataframes subpackage. Just like Dask arrays mimic NumPy arrays, Dask DataFrames mimic pandas DataFrames.
Here, we read one CSV file into memory using pandas. But this CSV is only one chunk of a dataset, which is too big to have in a single CSV or in memory. Using Dask, we can lazily load all of the CSVs in the folder into a single Dask DataFrame.
3. Dask DataFrames
When the Dask DataFrame is created, Dask takes a quick look at one of the CSV files to retrieve the names and data type of each column. But it doesn't load any data.
Here, the number of partitions is 3. This means the dataset is split into 3 chunks, since there are 3 CSV files.
4. Dask DataFrame task graph
When we visualize the task graph for the DataFrame, we can see this clearly. Dask will read from CSV 3 times.
5. Controlling the size of blocks
Sometimes, we will want chunks which are smaller than each CSV file. When using multi-threading or parallel processing, Dask will have several different chunks in memory at once, and if the chunks are too big, we will run out of RAM.
We can limit the memory size of each chunk by setting the blocksize parameter. This example will limit each chunk to ten megabytes. The number of partitions, which means the number of chunks, has gone up from 3 to 7.
6. Explaining partitions
This is because of the file sizes of the CSVs.
7. Explaining partitions
The first file is smaller than the blocksize, so it forms one partition. The second file is split into two, and the third file is split into 4.
8. Analysis with Dask DataFames
Most of the commands which we can use with pandas DataFrames also work on Dask DataFrames. We can select columns, and assign new columns or overwrite old ones. We can run mathematical methods on them, perform groupbys, and use methods like the nlargest method we used in Chapter 1. This particular example finds the top 3 rows sorted by col1.
All of these will return lazy Dask DataFrames or Dask Series. A Series is just a column of a DataFrame.
9. Datetimes and other pandas functionality
Many functions from pandas also have Dask equivalents, such as this example of converting a column of strings to datetimes. Even methods like the datetime accessor, which allows us to select parts of the datetime, work in Dask. These also return lazy results.
10. Making results non-lazy
We can create a preview of a DataFrame using its dot-head method, which returns a 5 row pandas DataFrame. To generate this, Dask will load the data and do whatever computations are needed to compute just these 5 rows.
To compute the results for the full DataFrame, we can use its compute method, which also returns a pandas DataFrame.
11. Sending answer straight to file
Sometimes, the DataFrame may still be too big to fit in memory after your analysis. When this happens, we can use Dask to save each partition to disk instead of ever holding the full result in memory at once.
Here, we save the DataFrame straight into some CSV files. Using a string with an asterisk means Dask will insert the partition number into the filename.
12. Faster file formats - Parquet
However, if we really want to speed up our DataFrame analysis, we should be using a modern file format like Parquet, which can make reading and writing data much faster. We can read a Parquet folder using Dask's read_parquet function, and write to Parquet using a DataFrame's to_parquet method.
13. Let's practice!
Alright, let's go to the exercises.