1. Data ingestion in BigQuery
Welcome back! Now that we know how to query data, we should also understand how to load data into BigQuery. Let’s take a look at a few ways to do so.
2. Methods of loading data
There are three main methods for batch loading data into BigQuery: via the BigQuery Studio workspace, the bq command line tools, or the LOAD DATA command in SQL. Each path has advantages and disadvantages, so understanding the tools will help us make the right decision for each use case. Keep in mind that this list does not include streaming data ingestion or any number of external tools and we will cover data ingestion conceptually in this course.
3. Using the BigQuery Studio
First is loading data via the BigQuery user interface in the Google Cloud Workspace. This allows us to select a file from a computer or Google Cloud Storage and import it into BigQuery. When we load the table, we must ensure that the file type is Avro, CSV, JSON, ORC, or Parquet.
4. Using the BigQuery Studio
This applies to all methods of loading bulk data. Then, we will provide the target project, dataset, and table name for the data.
5. Using the BigQuery Studio
If we want to provide a schema for the table, we can do so using JSON or the user interface by adding an item for each row. Otherwise, BigQuery will attempt to auto-detect the schema.
6. Using the bq command line tools
BigQuery also has a set of command line tools that allow us to ingest data files, run queries, manage data, and more. We can import files from our local computer or those stored in Google Cloud Storage. Note that you do not need to add a project, as the command line tool will prompt you to add a default project. This is followed by the URI to our Google Cloud Storage object. We can also use many other optional flags with these commands to have fine-grained control over our import process. As we can see here, we use the flags for a CSV file defined in the format flag and the "autodetect" flag to detect the schema automatically.
7. Using LOAD DATA in SQL
The final option we will cover is using the LOAD DATA SQL command that allows us to load data stored in Google Cloud Storage into tables in BigQuery. As we can see here, we are loading a CSV into a table named dataset.table. Next, we use the command "from files" and, in parentheses, provide the arguments. First are the URIs, which can only be datasets in Google Cloud Storage. As we can see, the files are added in an array, so we can add more than one file as long as it has the same schema. Then, we provide two other arguments: the file format and the CSV-specific command that tells BigQuery to skip the first row of data. There are many other arguments in this command that can be used for specific data formatting and other use cases.
8. Data ingestion considerations
The main consideration for loading data is whether we want to load our data from a local file. If we want to load data locally, we cannot use the LOAD DATA command, and our data must be under 100 MB for the other two methods. The bq command line and GCP Workspace method work the same and are subject to some other limitations, which can be found at the link in the citations.
9. Let's practice!
Now that you understand the core methods for ingesting batch data into BigQuery let's test these methods.