Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  • Prerequisites
  • Service Account
  • GCS Bucket
  • Create Connector for Google Cloud Storage
  1. Guides
  2. Utility Features
  3. Connect to data source(s)
  4. Cloud Data Platforms

BigQuery

PreviousAWS S3NextGoogle Cloud Storage

Last updated 1 month ago

This document describes the Dataworkz connector configuration required to access BigQuery. There are certain prerequisites for the BigQuery connector can be configured.

Prerequisites

Service Account

Dataworkz requires authorizing a Service Account to access BigQuery. Follow the instructions for the purpose of setting up a Service Account. Ensure that service account that has been created has access to the BigQuery (refer to for the permissions that the service account needs to be granted for accessing BigQuery.

GCS Bucket

Dataworkz internally uses Apache Spark. In Spark when writing data to BigQuery, one can use a temporaryGcsBucket option to specify a Google Cloud Storage (GCS) bucket where Spark will temporarily store data before loading it into BigQuery. The temporaryGcsBucket option is required when the data size exceeds a certain threshold and it helps optimize the export process by using Cloud Storage as an intermediate storage location. The connector writes the data to BigQuery by first buffering all the data into a Cloud Storage temporary table. Then it copies all data from into BigQuery in one operation. See - for details

BigQuery connector has two write modes(writeMethod), while writing data into BigQuery

  • Direct

  • Indirect

The above parameter is optional, and the default is Indirect.

Indirect (default)

You can specify indirect option like this option("writeMethod","indirect"). This requires you to specify a temporary GCS bucket, if not you will get an error.

GCS bucket needs to be configured as well keeping the following in mind.

Bucket Permissions:

The GCS bucket should have the necessary permissions for writing and reading data. The user account or service account used by Spark/Dataworkz to write to the GCS bucket must have the roles/storage.objectCreator and roles/storage.objectViewer roles on the bucket.

Staging Data:

The GCS bucket is used to temporarily stage data before it is loaded into BigQuery. Make sure the bucket has sufficient storage space to accommodate the data being transferred.

Google Cloud Storage (GCS) Location:

The GCS bucket must be located in the same region as the BigQuery cluster to optimize data transfer performance. For example, if your BigQuery cluster is running in 'us-south1' region, one must use a bucket in the 'us-south1' region only.

Create Connector for Google Cloud Storage

  1. Login to Dataworkz Application

  2. Go to Configuration -> Cloud Data Platforms -> BigQuery

  3. Click the + icon to add a new configuration

  1. Enter name for the storage configuration in the above screen

  2. Enter the Google Project ID

  3. Enter the storage base path

  4. Click "Choose File" to select the service account key (json file) and then click the upload button

  5. Test the connection by clicking "Test Connection"

  6. If successful click Save

Newly created connector would show up in the list of BigQuery configurations

You can choose to edit the configuration by choosing the edit icon under Action column.

Click "Re-upload" to upload the modified key (json file format). You can choose to save once Test Connection is successful.

Setup Service Account
https://cloud.google.com/bigquery/docs/use-service-accounts
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#reading_and_writing_data_from_bigquery