1. Loading
In this video, we'll cover the last step in the ETL process: load. At this point, we've extracted and transformed our data. It now makes sense to load the data ready for analytics.
2. Analytics or applications databases
As we mentioned earlier, in databases, there's a clear separation between databases for analytics and databases for applications.
For example, complex aggregate queries frequently run on analytical databases, so we should optimize them.
On the other hand, application databases have lots of transactions per second so we should optimize them for that. This ties in with something we mentioned before. We often optimize application databases for online transaction processing or OLTP.
On the other hand, we optimize analytical databases for online analytical processing or OLAP.
3. Column- and row-oriented
At this point, it makes sense to come back to row- vs. column-oriented databases. As mentioned in the first video of this chapter, most application databases are row-oriented. That means we store data per record, which makes it easy to add new rows in small transactions. For example, in a row-oriented database, adding a customer record is easy and fast. In a column-oriented database, we store data per column. There are multiple reasons why this is optimal for analytics. Without getting too technical, you can think of analytical queries to be mostly about a small subset of columns in a table. By storing data per column, it's faster to loop over these specific columns to resolve a query. In a row-oriented system, we would lose time skipping unused columns for each row. Column-oriented databases also lend themselves better to parallelization.
4. MPP Databases
That brings us seamlessly to a type of database which is often a target at the end of an ETL process. They're called massively parallel processing databases. They're column-oriented databases optimized for analytics, that run in a distributed fashion. Specifically, this means that queries are not executed on a single compute node, but rather split into subtasks and distributed among several nodes. Famous managed examples of these are Amazon Redshift, Azure SQL Data Warehouse, or Google BigQuery.
5. An example: Redshift
Let's look at an example. To load data into Amazon Redshift, an excellent way to do this would be to write files to S3, AWS's file storage service, and send a copy query to Redshift. Typically, MPP databases load data best from files that use a columnar storage format. CSV files would not be a good option, for example. We often use a file format called parquet for this purpose. There are helper functions to write this kind of files in several packages. For example, in pandas, you can use the `.to_parquet()` method on a dataframe. In PySpark, you can use `.write.parquet()`. You can then connect to Redshift using a PostgreSQL connection URI and copy the data from S3 into Redshift, like this.
6. Load to PostgreSQL
In other cases, you might want to load the result of the transformation phase into a PostgreSQL database. For example, your data pipeline could extract from a rating table, transform it to find recommendations and load them into a PostgreSQL database, ready to be used by a recommendation service. For this, there are also several helper methods in popular data science packages. For example, you could use `.to_sql()` in Pandas. Often, you can also provide a strategy for when the table already exists. Valid strategies for `.to_sql()` in Pandas are: "fail", "replace" and "append".
7. Let's practice!
That's all for this video. Let's do some exercises.