Working with CSV files
Introduction
This document describes how you can use Dataworkz for processing CSV files stored in AWS S3. It covers how CSV files should be stored in any cloud object store including AWS S3, Azure Blob Storage and Google Cloud Storage.
In addition to CSV files, Dataworkz also supports TSV, Parquet and JSON files stored in object stores.
An example of how to organize data coming in on a regular cadence into the object store is described later in this document.
At the high level, the following steps needs to be followed -
Organize cloud object store data
Create workspace in Dataworkz
Discover data in cloud storage
Verify text qualifier, delimiter and headers detected
Identify which fields contain PII data and mask them appropriately if needed
Setup downstream processing of CSV data by defining transformation jobs
Create a pipeline to move transformed CSV to destination
Organize cloud object store data - CSV stored in AWS S3
Every enterprise has a different way in which data might be organized in the data lake and Dataworkz has the flexibility to work with multiple ways in which data might come into a data lake. One way to organize the cloud storage when there are multiple external systems writing CSV data, in different formats, into the lake
Store all files of a specific type, typically generated by an external system, in their own folder. E.g. if there are five different partners, providing different types of data, each with its own CSV schema, then a folder for each one would be recommended way to organize the data being ingested –
Partner1, Partner2, Partner3, Partner4, Partner5
Additionally, Dataworkz recommends that a separate CSV file be written for each “year/month/day”.
The folder structure in the cloud object store should be organized as follows -
<folder-name>/<year>/<month>/<day>/<file-name>
Example:
A supplier was creating a CSV file with product inventory that was updated on the first day of every month, the S3 structure should be -
product-inventory/2021/01/15/inventory.csv
A supplier was creating a CSV file with product inventory that was updated daily -
product-inventory/2021/01/01/inventory.csv
product-inventory/2021/01/02/inventory.csv
….
….
product-inventory/2021/01/29/inventory.csv
product-inventory/2021/01/30/inventory.csv
A supplier was creating a CSV file with product inventory that was updated hourly -
product-inventory/2021/01/01/inventory_01.csv
product-inventory/2021/01/01/inventory_02.csv
….
….
product-inventory/2021/01/01/inventory_15.csv
product-inventory/2021/01/01/inventory_20.csv
The name of each new file should be unique. The name used above is representative and there are no special requirements on the file name as long as each new file is unique.
Dataworkz Workspace
A workspace is a logical structure for grouping related datasets in Dataworkz. Dataworkz provides both coarse and fine grained access control for datasets. Access to a workspace can be controlled using RBAC provided natively in Dataworkz.
Discover data in S3
To discover data in cloud storage, a cloud object store instance needs to be created in Dataworkz. To do so, you will require the administrator permissions. Once the storage has been created and tested, the following steps need to be followed -
Choose the “Discover” menu option in the Dataworkz web application to navigate to the workflow for discovering new data.
Select the appropriate workspace and storage type on the page.
Upon selecting the storage, Dataworkz retrieves the details of the folders and objects stored in the location specified in the storage.
You can navigate to a specific folder by clicking on the list of available folders.
When you have selected the appropriate folder, click on the “Discover” button to retrieve the list of files stored under that specific folder.
The list displays the name of the folder and type of file stored in the folder. For CSV files, you will also see the delimiter and text qualifiers that are detected by Dataworkz.
Additionally, you will also see whether Dataworkz has detected a header in the file.
Verify column delimiter, text qualifier and headers detected
Dataworkz gives you the ability to verify that the correct column delimiter and text qualifier have been detected.
Additionally, you can also check if the headers were included in the first line of the CSV file. If no headers are detected, Dataworkz will automatically create header names (Self_Discovered_1, Self_Discovered_2 … Self_Discovered_N) and detect the data type for each column.
The list of supported data types are -
Text
String
Int
Long
Double
Float
Date
Time
Date time
Timestamp
Boolean
If a file contains headers, then each new version of the file would also contain the header information. Dataworkz monitors the files for any deviation from the defined schema and generates alerts when changes are detected.
Additionally, Dataworkz can use a header file provided by the end user to specify the header names and data type for each column in the input file. End users can upload a header file which Dataworkz uses to assign - name to the column, its data type, data format, describe the column and define if the column contains PII information. The header file will always contain 5 columns and (n+1) rows, where n is the total number of columns in the input file.
Example:
Consider a CSV input file that contains 4 columns as shown below -
tt0000001, 5.6, 1611, 04/01/1980
tt0000002, 6.0, 1980, 01/01/1975
tt0000003, 6.5, 1289, 11/15/1992
The headers for the CSV data above can be need to be described as follows -
Employee_Id
String
Unique Identifier
N
Rating
Double
Employee Rating
N
Salary
Long
Monthly Salary
N
DoB
Date
MM/dd/yyyy
Date of Birth
Y
The corresponding header file for the sample input above will contain 5 columns and 5 rows as shown below -
Header Name,Header Type,Header Format,Header Desc,PII Data
Employee_Id, String, , Unique Identifier, N
Rating, Double, , Employee Rating, N
Salary,Long, , Monthly salary, N
DoB, Date, MM/dd/yyyy, Date of Birth, Y
Row 1 in the header file describes what each column in the header file means.
Row 2 in the header file describes the first column in the input file - name (Employee_Id), data type(String), data format(empty), column description(Unique Identifier) and PII value(N).
Row 3 in the header file describes the second column in the input file and so on.
Setup downstream processing of CSV data by defining transformations
Dataworkz provides a self-service web based experience to transform raw CSV data as needed. The dataset explorer provides a list of transformations based on the data type of each column.
There are more than 50 transformation functions available in the Dataworkz UI including
case statements with multiple columns
masking/hashing
deduplication
aggregation
Dataworkz provides self-documenting lineage and each step can be traced back to the source in the lineage view.
Create a pipeline to move transformed CSV to destination
Once the data is in the desired format, it can be moved to the required destination. The data written to the destination can be configured as a one time or a recurring job by selecting the appropriate UI options.
Last updated