BigQuery
This document describes the Dataworkz connector configuration required to access BigQuery. There are certain prerequisites for the BigQuery connector can be configured.
Prerequisites
Service Account
Dataworkz requires authorizing a Service Account to access BigQuery. Follow the Setup Service Account instructions for the purpose of setting up a Service Account. Ensure that service account that has been created has access to the BigQuery (refer to https://cloud.google.com/bigquery/docs/use-service-accounts for the permissions that the service account needs to be granted for accessing BigQuery.
GCS Bucket
Dataworkz internally uses Apache Spark. In Spark when writing data to BigQuery, one can use a temporaryGcsBucket option to specify a Google Cloud Storage (GCS) bucket where Spark will temporarily store data before loading it into BigQuery. The temporaryGcsBucket option is required when the data size exceeds a certain threshold and it helps optimize the export process by using Cloud Storage as an intermediate storage location. The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all data from into BigQuery in one operation. See - https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#reading_and_writing_data_from_bigquery for details
BigQuery connector has two write modes(writeMethod), while writing data into BigQuery
Direct
Indirect
The above parameter is optional, and the default is Indirect.
Indirect (default)
You can specify indirect option like this option("writeMethod","indirect"). This requires you to specify a temporary GCS bucket, if not you will get an error.
GCS bucket needs to be configured as well keeping the following in mind.
Bucket Permissions:
The GCS bucket should have the necessary permissions for writing and reading data. The user account or service account used by Spark/Dataworkz to write to the GCS bucket must have the roles/storage.objectCreator and roles/storage.objectViewer roles on the bucket.
Staging Data:
The GCS bucket is used to temporarily stage data before it is loaded into BigQuery. Make sure the bucket has sufficient storage space to accommodate the data being transferred.
Google Cloud Storage (GCS) Location:
The GCS bucket must be located in the same region as the BigQuery cluster to optimize data transfer performance. For example, if your BigQuery cluster is running in 'us-south1' region, one must use a bucket in the 'us-south1' region only.
Create Connector for Google Cloud Storage
Login to Dataworkz Application
Go to Configuration -> Cloud Data Platforms -> BigQuery
Click the + icon to add a new configuration
Enter name for the storage configuration in the above screen
Enter the Google Project ID
Enter the storage base path
Click "Choose File" to select the service account key (json file) and then click the upload button
Test the connection by clicking "Test Connection"
If successful click Save
Newly created connector would show up in the list of BigQuery configurations
You can choose to edit the configuration by choosing the edit icon under Action column.
Click "Re-upload" to upload the modified key (json file format). You can choose to save once Test Connection is successful.
Last updated