Loading data to a SQL database with pandas
1. Loading data to a SQL database with pandas
Before we explore loading data to SQL databases, let's check-in one last time on our architecture diagram.2. Load data to a SQL database with pandas
In the last three lessons, we've extracted and transformed data from non-tabular data sources. Now, we're ready to tackle the last component of our ETL pipeline: loading data to a SQL database.3. Loading data into a SQL database with pandas
SQL databases are one of the most common landing zones for data after transformation in a pipeline. It's easy to connect a visualization tool directly to a SQL database, and most data analysts and scientists are comfortable using SQL to query data. To load a DataFrame to a Postgres database, pandas provides the dot-to_sql method. This method is called on the DataFrame to be loaded, and takes parameters name, con, if_exists, index, and index_label. Let's take a closer look, with the help of an example!4. Persisting data to Postgres with pandas
Similar to the read_sql function, a connection object needs to be created to load data to a SQL database. This can be done in an identical way as before, by creating a connection URI and engine with the help of sqlalchemy. This connection URI stores the engine, username, password, host, port, and database, which is the "market" database, in our example. In our stock data example, dot-to_sql is called on the clean_stock_data DataFrame. "filtered_stock_data" is passed to the name parameter, which is the name of the table in the market database that the DataFrame will be written to. The recently-created db_engine is passed to the con parameter, forming a connection with the market database. The value passed to if_exists determines how data is added to the table in a database. If "append" is added, data is added to the existing table. Passing "replace" overwrites the existing data in the table with the current DataFrame. Here, data stored in the clean_stock_data DataFrame is appended to the existing "filtered_stock_data" table in the "market" database. The index parameter takes a boolean value, True if the index is to be loaded to the SQL table, and False otherwise. If index is set to True, the index_label parameter can be supplied. This is the name of the column in SQL that the pandas index will be written to. In our example, the index is written to Postgres, and has column name "timestamps".5. Validating data persistence with pandas
Once data has been loaded to a Postgres database, it's important to validate that data has been persisted as expected. If not, downstream consumers and processes will be limited by faulty data, and data users may even lose confidence in the data created by the pipeline. First and foremost, data should be "query-able" and return logical results. This includes making sure that counts match between transformed data and persisted data, and validating that each row is present in both DataFrames. With pandas, this is easy! Using the read_sql function from earlier, the data loaded to a SQL database can be queried for validation. This DataFrame can then be compared with the transformed DataFrame, to validate counts and record equality. Including these manual data quality checks is important not only when developing a data pipeline, but also in monitoring and alerting efforts. Providing this layer of validation helps instill trust in your data pipelines and ensures a robust solution has been implemented.6. Let's practice!
Loading data to a SQL database makes it available to downstream consumers and processes for analytics and reporting use cases. Validating that this data has been persisted as expected helps to instill trust in a data pipeline, and ensures that a robust and complete solution has been developed. Time to practice with a few exercises!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.