Get startedGet started for free

Data Intelligence Platform - Data

1. Data Intelligence Platform - Data

Let's see how data works in the Data Intelligence Platform. By the end, you'll understand Databricks' recommendations for storing, managing, and governing data.

2. Why do organizations care about data management?

Why do organizations care about data management? Although it may seem dry and technical, it's crucial for any organization with data. The main reason is to protect and secure their data. Organizational data offers a competitive edge and may include sensitive customer information, necessitating strong data security measures. Effective data management also enhances data analytics. When data is well-managed, teams have greater confidence in its accuracy, allowing them to derive meaningful insights for the business.

3. Kinds of data

There are three main kinds of data that you will encounter. Every system can support different data formats and types, but they all contain valuable information. Structured data is what we are most familiar with, as it is data that has a common row and column structure. This is the data we know from databases and often comes in the form of a CSV file. Here is an example dataset regarding different people and their occupations.

4. Kinds of data

Semi-structured data is somewhat common, especially with web-based applications. This data will have patterns of keys and values that are somewhat structured but overall more flexible in content. Common semi-structured data types are JSON and XML. Here, we have the same occupation data from the structured dataset but in a JSON format.

5. Kinds of data

Unstructured data is increasingly common due to the rise in devices on the edge, such as cameras and mobile devices. Common formats here include PNG and mp4 files. These formats have a lot of information but are harder to parse out.

6. Delta

With Databricks, we recommend storing data in the Delta format. Delta is an open-source project that brings data warehouse performance to the lakehouse. Under the hood is a collection of Parquet files with additional metadata and a JSON transaction log. This means that your data will function like a database table but will be in an open file format in the lakehouse. Your data will be fully ACID compliant and can handle batch and streaming datasets in one place.

7. Unity Catalog

Once data has been stored, we need to manage how that data is accessed and governed. In Databricks, Unity Catalog provides a holistic governance layer on top of everything in the lakehouse. It provides granular access control to every data asset in the lakehouse, from data tables to machine learning models.

8. Unity Catalog

Unity Catalog is easy to use and leverages common SQL statements such as GRANT and REVOKE. You can control who has access to each data asset and what kind of access that person has.

9. Catalog Explorer

The Catalog Explorer brings all of Databricks's data management concepts together and is a single location where you can explore all of your data assets. In this UI, you can discover information about the data, manage your Unity Catalog permissions, and view all related assets. This is a central feature of Databricks and is your one-stop shop for all things data management.

10. Let's practice!

Now, let us review and practice data management in our learning environment.