Get startedGet started for free

Dataform

1. Dataform

Dataform is a serverless framework that simplifies the development and management of ELT pipelines using SQL, Dataform enables data transformation within BigQuery, ensuring data quality and providing documentation, This approach streamlines the process of moving data from source systems to production tables in BigQuery, making operations more efficient and manageable, Dataform streamlines data operations in BigQuery by unifying transformation, assertion, and automation, Without Dataform, tasks like defining tables, managing code, testing data quality, and scheduling pipelines would be time-consuming and prone to errors, They could also involve multiple tools and manual processes, Dataform simplifies these tasks within BigQuery, improving efficiency and data reliability, Dataform and BigQuery work together to manage SQL workflows, With Dataform, developers create and compile SQL workflows using SQL and JavaScript, Dataform then performs real-time compilation, including dependency checks and error handling, Finally, the compiled SQL workflows are executed within BigQuery, enabling SQL transformations and materialization either on-demand or through scheduled runs, Development using Dataform utilizes workspaces containing default files and folders, Key folders include 'definitions' for sqlx files and 'includes' for JavaScript files, The ,gitignore file is used for managing Git commits, Developers may also use package,json and package-lock,json for handling JavaScript dependencies, The 'workflow settings,yaml' file stores, project compilation settings, and custom files like README,md can also be added, The sqlx file structure provides a clear framework for organizing SQL code and associated tasks, It begins with a config block for metadata and data quality tests, followed by a js block to define reusable JavaScript functions. The pre_operations block handles SQL statements executed before the main SQL body, which defines the core SQL logic. Finally, the post_operations block contains SQL statements to be run after the main execution, ensuring a structured and efficient workflow. sqlx development streamlines SQL code by replacing repetitive patterns with concise definitions. The code example demonstrates how a complex CASE statement for categorizing countries can be replaced with a simple function call $(mapping.region("country")). This approach improves code readability and maintainability by reducing boilerplate code and promoting reusability. With Dataform, table and view definitions should be created in a specific manner so that they can be compiled into SQL statements. Key configuration types are: declaration for referencing existing BigQuery tables. table for creating or replacing tables with a SELECT statement, incremental for creating tables and updating them with new data, and view for creating or replacing views, which can optionally be materialized. Dataform offers assertions to define data quality tests, ensuring data consistency and accuracy. Assertions can be written in SQL or JavaScript, providing flexibility for complex checks. Operations allow you to run custom SQL statements before, after, or during pipeline execution. These two options enable custom data transformations, data quality checks, and other tasks within your workflows. By combining assertions and operations, Dataform empowers you to create robust and reliable data pipelines in BigQuery. Dataform provides two methods to manage dependencies, implicit declaration and explicit declaration. Implicit declaration is when you reference tables or views directly within your SQL using the ref() function. Explicit declaration is when you list dependencies within a config block using the dependencies array. It is also possible to use the resolve() function to reference without creating a dependency. Dataform allows you to compile user-defined table definitions into executable SQL scripts. The sample code shows a customer_details table being created or replaced based on a customer_source table using a SELECT statement. Dataform manages the dependencies between these tables and orchestrates their execution within a workflow. This process streamlines data transformation and ensures efficient data pipeline management. Dataform SQL workflows are best visualized in graph format. The sample workflow starts with a declaration of customer_source followed by a customer_intermediate table, likely derived from a source system as a pre-processed data source. Next, customer_rowConsistency applies assertions for data quality checks. The graph then splits into two paths. In one path, an operation named customer_ml_training is invoked. It performs operations on the validated data. In the other path, a view named customer_prod_view is created. There are several scheduling and execution mechanisms for Dataform SQL workflows. One path is through internal triggers. These include manual execution in the Dataform UI or scheduled configurations within Dataform itself. The other path is through external triggers. These include tools like Cloud Scheduler and Cloud Composer. Ultimately, all workflows are executed within BigQuery, showcasing its central role in this process.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.