Data Sources Versus Data Sinks

1. Data Sources Versus Data Sinks

The ingest stage of a data pipeline is the point where data becomes a data source and is available for usage downstream. Think of a data source as the starting point of your data journey. It is raw, unprocessed data waiting to be transformed into valuable insights. Any system, application, or platform that creates, stores, or shares data can be considered a data source. Two examples of Google Cloud products used in the ingest phase are Cloud storage, a data lake holding various types of data sources, and Pub/Sub, an asynchronous messaging system delivering data from external systems. The transform stage of a data pipeline represents action taken on a data source to adjust, modify, join, or customize a data source so that it matches a specific downstream data or reporting requirement. There are three main transformation patterns: extract and load, extract, load, and transform, and extract, transform, and load. You explore each of these patterns in their own modules later in the course. The store stage of a data pipeline represents the last step when we deposit data in its final form. A data sync is the final stop in the data journey. It's where processed and transformed data is stored for future use, analysis, and decision-making. Think of it as the reservoir at the end of the river, where valuable information is collected and readily available. Two examples of Google Cloud products used in the store phase are BigQuery, a serverless data warehouse, and Bigtable, a highly scalable no SQL database.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

This exercise is part of the course

Introduction to Data Engineering on Google Cloud

BeginnerSkill Level

4.8+

Start Course for Free

This section welcomes you to the Introduction to Data Engineering on Google Cloud course, and provides an overview of the course structure and goals.

Exercise 1: Course Introduction

This module provides an introduction to the role of a data engineer. It covers key concepts such as data sources and sinks, data formats, storage options on Google Cloud, metadata management, and the use of Analytics Hub for data sharing within and outside an organization.

Exercise 1: Module Introduction Exercise 2: The Role of a Data Engineer Exercise 3: Data Sources Versus Data Sinks

Current Exercise

Exercise 4: Data Formats Exercise 5: Storage Solution Options on Google Cloud Exercise 6: Metadata Management Options on Google Cloud Exercise 7: Sharing Datasets using Analytics Hub Exercise 8: Lab Intro: Loading Data into BigQuery Exercise 9: Loading Data into BigQuery Exercise 10: Quiz Question 1 Exercise 11: Quiz Question 2 Exercise 12: Quiz Question 3 Exercise 13: Quiz Question 4 Exercise 14: Quiz Question 5

This module provides an overview of data replication and migration on Google Cloud. It covers the basic architecture, the 'gcloud' command-line tool, Storage Transfer Service, Transfer Appliance, and Datastream, along with their functionalities and use cases.

Exercise 1: Module Introduction Exercise 2: Replication and Migration Architecture Exercise 3: The gcloud Command Line Tool Exercise 4: Moving Datasets Exercise 5: Datastream Exercise 6: Lab Intro: Datastream: PostgreSQL Replication to BigQuery Exercise 7: Datastream: PostgreSQL Replication to BigQuery Exercise 8: Quiz Question 1 Exercise 9: Quiz Question 2 Exercise 10: Quiz Question 3 Exercise 11: Quiz Question 4 Exercise 12: Quiz Question 5

This module focuses on data extraction and loading processes on Google Cloud, particularly with BigQuery. It covers the basic extraction and loading architecture, the bq command-line tool, BigQuery Data Transfer Service, and BigLake as an alternative to traditional extract-load patterns.

Exercise 1: Module Introduction Exercise 2: Extract and Load Architecture Exercise 3: The bq Command Line Tool Exercise 4: BigQuery Data Transfer Service Exercise 5: BigLake Exercise 6: Lab Intro: BigLake: Qwik Start Exercise 7: Lakehouse: Qwik Start Exercise 8: Quiz Question 1 Exercise 9: Quiz Question 2 Exercise 10: Quiz Question 3 Exercise 11: Quiz Question 4 Exercise 12: Quiz Question 5

This module provides an overview of ELT (extract, load, transform) processes on Google Cloud. It covers the basic ELT architecture, a common ELT pipeline example, BigQuery's capabilities for scripting and scheduling SQL, and the functionality and use cases of Dataform.

Exercise 1: Module Introduction Exercise 2: Extract, Load, and Transform (ELT) Architecture Exercise 3: SQL Scripting and Scheduling with BigQuery Exercise 4: Dataform Exercise 5: Lab Intro: Create and Execute a SQL Workflow in Dataform Exercise 6: Create and execute a SQL workflow in Dataform Exercise 7: Quiz Question 1 Exercise 8: Quiz Question 2 Exercise 9: Quiz Question 3 Exercise 10: Quiz Question 4 Exercise 11: Quiz Question 5

This module provides an overview of ETL (extract, transform, load) processes on Google Cloud. It covers the basic ETL architecture, GUI tools, batch and streaming data processing options (Dataproc, Dataproc Serverless), and the role of Bigtable in data pipelines.

Exercise 1: Module Introduction Exercise 2: Extract, Transform, and Load (ETL) Architecture Exercise 3: Google Cloud GUI Tools for ETL Data Pipelines Exercise 4: Batch Data Processing Using Dataproc Exercise 5: Lab Intro: Use Serverless for Apache Spark to Load BigQuery Exercise 6: Use Serverless for Apache Spark to Load BigQuery Exercise 7: Streaming Data Processing Options Exercise 8: Bigtable and Data Pipelines Exercise 9: Lab Intro: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow Exercise 10: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow Exercise 11: Quiz Question 1 Exercise 12: Quiz Question 2 Exercise 13: Quiz Question 3 Exercise 14: Quiz Question 4 Exercise 15: Quiz Question 5

This module focuses on automation patterns and options for pipelines on Google Cloud. It covers various tools and services like Cloud Scheduler, Workflows, Cloud Composer, Cloud Run functions, and Eventarc, along with their functionalities and use cases for automation.

Exercise 1: Module Introduction Exercise 2: Automation Patterns and Options for Pipelines Exercise 3: Cloud Scheduler and Workflows Exercise 4: Cloud Composer Exercise 5: Cloud Run Functions Exercise 6: Eventarc Exercise 7: Lab Intro: Use Cloud Run Functions to Load BigQuery Exercise 8: Use Cloud Run Functions to Load BigQuery Exercise 9: Quiz Question 1 Exercise 10: Quiz Question 2 Exercise 11: Quiz Question 3 Exercise 12: Quiz Question 4 Exercise 13: Quiz Question 5

In this final section, we review what was presented in this course and discuss the next steps to continue your cloud learning journey.

Exercise 1: Course Summary Exercise 2: Course Resources