What's the difference between data warehouses and data lakes?
1. What's the difference between data warehouses and data lakes?
Welcome back. This video clarifies the differences between a data warehouse, a data mart, and a data lake. Their similar-sounding names that can be confusing.2. Database
Most people are familiar with databases so we will start with them. Databases use tables to store information in a structured way with rows and columns. Organizations use databases to store the different transactions that happen within the organization. We will call them transactional databases. For example, when a company sells a good to a customer, the information about that transaction is stored in a database. Now let's look at a data warehouse.3. Data warehouse
Data warehouses gather data from different areas of an organization, integrate it, and make it available for analysis. They are built as a central data store for the entire organization, representing many departments. Therefore, there are many data sources as input to the data warehouse, including multiple databases or even non-databases such as a log file. All this data is collected, transformed if needed, and integrated into a structured format into the data warehouse in an ETL process or extract, transform, and load. We will talk more about ETL processes in a later video, but it is important to note that data in the data warehouse is structured into tables with rows and columns. This structure can make it complex to change because of upstream and downstream effects. Also, data warehouses are large, typically larger than 100 GB. You might think, why not query the different transactional databases when performing an analysis versus using a data warehouse?4. Why the data warehouse?
Running queries involving a large amount of data could take a very long time, possibly slowing down the database and restricting it from its primary purpose of recording transactions. Now that we have discussed data warehouses let's discuss data marts.5. Data marts
A data mart is a relational database that stores an organization's transactional data for analysis. Data marts and data warehouses both hold structured data. However, whereas a data warehouse contains data from many different departments, a data mart only focuses on one department, such as just Finance. Data marts have only a few input data sources versus a data warehouse with many. Often the input source is a subset of data from a data warehouse. Additionally, a data mart is typically less than 100 GB, which is smaller than most data warehouses. So, let's extend our understanding now and discuss data lakes.6. Data lake
Data lakes, similar to data warehouses, are built as a central store of data for the entire organization for analysis. Therefore, they store data for many different departments, have many input data sources, and are large. However, data lakes can hold non-structured data, such as videos or audio files. A Data Scientist might use this unstructured data in a model that analyzes video of a production process for quality control.7. Data lake
Compared to data warehouses, it is easier to make changes to data lakes because of their flexibility in storing unstructured data. This flexibility also allows storing data whose purpose may not be known today but may be helpful for future analysis. In contrast, when designing table structures for data warehouses and marts, organizations tend to know how they want to use the data for analysis.8. Summary
In summation, data lakes hold both structured and unstructured data, while the others have only structured. Data lakes are easier to change but may contain data with an unknown purpose. Data marts only hold data for one department with few sources and are smaller in size. The alternatives have data from many departments and sources and are significantly larger.9. Let's practice!
Time to practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.