Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  • Introduction
  • Organize cloud object store data - CSV stored in AWS S3
  • Dataworkz Workspace
  • Discover data in S3
  • Verify column delimiter, text qualifier and headers detected
  • Example:
  • Setup downstream processing of CSV data by defining transformations
  • Create a pipeline to move transformed CSV to destination
  1. Guides
  2. Data Studio

Working with CSV files

Introduction

This document describes how you can use Dataworkz for processing CSV files stored in AWS S3. It covers how CSV files should be stored in any cloud object store including AWS S3, Azure Blob Storage and Google Cloud Storage.

In addition to CSV files, Dataworkz also supports TSV, Parquet and JSON files stored in object stores.

An example of how to organize data coming in on a regular cadence into the object store is described later in this document.

At the high level, the following steps needs to be followed -

  • Organize cloud object store data

  • Create workspace in Dataworkz

  • Discover data in cloud storage

  • Verify text qualifier, delimiter and headers detected

  • Identify which fields contain PII data and mask them appropriately if needed

  • Setup downstream processing of CSV data by defining transformation jobs

  • Create a pipeline to move transformed CSV to destination

Organize cloud object store data - CSV stored in AWS S3

Every enterprise has a different way in which data might be organized in the data lake and Dataworkz has the flexibility to work with multiple ways in which data might come into a data lake. One way to organize the cloud storage when there are multiple external systems writing CSV data, in different formats, into the lake

  • Store all files of a specific type, typically generated by an external system, in their own folder. E.g. if there are five different partners, providing different types of data, each with its own CSV schema, then a folder for each one would be recommended way to organize the data being ingested –

    1. Partner1, Partner2, Partner3, Partner4, Partner5

  • Additionally, Dataworkz recommends that a separate CSV file be written for each “year/month/day”.

  • The folder structure in the cloud object store should be organized as follows -

    <folder-name>/<year>/<month>/<day>/<file-name>

Example:

  1. A supplier was creating a CSV file with product inventory that was updated on the first day of every month, the S3 structure should be -

product-inventory/2021/01/15/inventory.csv

  1. A supplier was creating a CSV file with product inventory that was updated daily -

product-inventory/2021/01/01/inventory.csv

product-inventory/2021/01/02/inventory.csv

….

….

product-inventory/2021/01/29/inventory.csv

product-inventory/2021/01/30/inventory.csv

  1. A supplier was creating a CSV file with product inventory that was updated hourly -

product-inventory/2021/01/01/inventory_01.csv

product-inventory/2021/01/01/inventory_02.csv

….

….

product-inventory/2021/01/01/inventory_15.csv

product-inventory/2021/01/01/inventory_20.csv

The name of each new file should be unique. The name used above is representative and there are no special requirements on the file name as long as each new file is unique.

Dataworkz Workspace

A workspace is a logical structure for grouping related datasets in Dataworkz. Dataworkz provides both coarse and fine grained access control for datasets. Access to a workspace can be controlled using RBAC provided natively in Dataworkz.

Discover data in S3

To discover data in cloud storage, a cloud object store instance needs to be created in Dataworkz. To do so, you will require the administrator permissions. Once the storage has been created and tested, the following steps need to be followed -

  • Choose the “Discover” menu option in the Dataworkz web application to navigate to the workflow for discovering new data.

  • Select the appropriate workspace and storage type on the page.

  • Upon selecting the storage, Dataworkz retrieves the details of the folders and objects stored in the location specified in the storage.

  • You can navigate to a specific folder by clicking on the list of available folders.

  • When you have selected the appropriate folder, click on the “Discover” button to retrieve the list of files stored under that specific folder.

  • The list displays the name of the folder and type of file stored in the folder. For CSV files, you will also see the delimiter and text qualifiers that are detected by Dataworkz.

Additionally, you will also see whether Dataworkz has detected a header in the file.

Verify column delimiter, text qualifier and headers detected

Dataworkz gives you the ability to verify that the correct column delimiter and text qualifier have been detected.

Additionally, you can also check if the headers were included in the first line of the CSV file. If no headers are detected, Dataworkz will automatically create header names (Self_Discovered_1, Self_Discovered_2 … Self_Discovered_N) and detect the data type for each column.

The list of supported data types are -

  • Text

  • String

  • Int

  • Long

  • Double

  • Float

  • Date

  • Time

  • Date time

  • Timestamp

  • Boolean

If a file contains headers, then each new version of the file would also contain the header information. Dataworkz monitors the files for any deviation from the defined schema and generates alerts when changes are detected.

Additionally, Dataworkz can use a header file provided by the end user to specify the header names and data type for each column in the input file. End users can upload a header file which Dataworkz uses to assign - name to the column, its data type, data format, describe the column and define if the column contains PII information. The header file will always contain 5 columns and (n+1) rows, where n is the total number of columns in the input file.

Example:

Consider a CSV input file that contains 4 columns as shown below -

tt0000001, 5.6, 1611, 04/01/1980

tt0000002, 6.0, 1980, 01/01/1975

tt0000003, 6.5, 1289, 11/15/1992

The headers for the CSV data above can be need to be described as follows -

Header Name
Header Type
Header Format
Header Desc
PII

Employee_Id

String

Unique Identifier

N

Rating

Double

Employee Rating

N

Salary

Long

Monthly Salary

N

DoB

Date

MM/dd/yyyy

Date of Birth

Y

The corresponding header file for the sample input above will contain 5 columns and 5 rows as shown below -

Header Name,Header Type,Header Format,Header Desc,PII Data

Employee_Id, String, , Unique Identifier, N

Rating, Double, , Employee Rating, N

Salary,Long, , Monthly salary, N

DoB, Date, MM/dd/yyyy, Date of Birth, Y

  • Row 1 in the header file describes what each column in the header file means.

  • Row 2 in the header file describes the first column in the input file - name (Employee_Id), data type(String), data format(empty), column description(Unique Identifier) and PII value(N).

Row 3 in the header file describes the second column in the input file and so on.

Setup downstream processing of CSV data by defining transformations

Dataworkz provides a self-service web based experience to transform raw CSV data as needed. The dataset explorer provides a list of transformations based on the data type of each column.

There are more than 50 transformation functions available in the Dataworkz UI including

  • case statements with multiple columns

  • masking/hashing

  • deduplication

  • aggregation

Dataworkz provides self-documenting lineage and each step can be traced back to the source in the lineage view.

Create a pipeline to move transformed CSV to destination

Once the data is in the desired format, it can be moved to the required destination. The data written to the destination can be configured as a one time or a recurring job by selecting the appropriate UI options.

PreviousConfiguring a UnionNextJob Monitoring

Last updated 1 year ago