1. Persisting data with pandas
Nice transformation! In this chapter, we've explored the basics of extracting and transforming data. Now, it's time to learn about loading data using pandas.
2. Persisting data in an ETL pipeline
Loading data to a file is a great way to ensure that data consumers, including data scientists and analysts, have stable access to extracted and transformed data.
While typically most visible during the "load" portion of an ETL process, data persistence is a best practice that can, and should, happen at multiple stages in a data pipeline. Persisting data to a file allows for a "snapshot" to be taken at various points throughout the pipeline. This is especially useful when recovering from a failure, and is essential if data is hard to reacquire from a source system.
3. Loading data to CSV files using pandas
pandas makes it easy to write a DataFrame to a file, using the to_csv method.
In our example, our raw stock market data is loaded into a DataFrame, and a basic transformation is applied. Then, the to_csv method is called on the DataFrame, with the desired file path for the DataFrame passed as an argument. This DataFrame is then written to the relative path stock_data-dot-csv.
In addition to the file path, there are a few more arguments that customize how a DataFrame is to be stored.
4. Customizing CSV file output
The first is the header argument. This can take value True, False, or a list of strings to alias column names. By default, this argument is True.
Another argument that can be passed to the to_csv method is the index argument. This too can take value True or False, defaulting to True. When True, the index column will be written to the file. Typically, this is useful if an index column is meaningful to the dataset, such as when it stores a unique transaction ID. Otherwise, it's best practice to set index to False.
The final argument that we'll dig into is the sep argument. sep takes a string, which is used to separate columns in the CSV file. The default value is a comma, but a common alternative is the pipe character.
The to_csv method has counterparts that load data to different storage media, such as the to_parquet, and the to_json method. Later in the course, we'll explore the to_sql method, which writes a DataFrame to a SQL table. In addition to these three methods, there are several others that can be used to load a DataFrame to a file or database.
5. Ensuring data persistence
Once we've written a DataFrame to a file, our next thought might be, "how do we know that worked?". A great way to validate this is to check the file path, and ensure the file has been stored there.
This can be done with the help of the os module.
Here, we use the path-dot-exists function to check if a file exists at the path that's passed to the function. In our example, we store the output of this function to a variable to inspect its value.
If the value of the variable is True, then the file path exists. If False, an error may have occurred along the way, causing the DataFrame to not be persisted.
This is a great tool to use when validating the "load" step in a data pipeline.
6. Let's practice!
Excellent! We've been exposed to using pandas to persist data to be used by downstream consumers and processes. Now, it's time to practice!