Catalog

The schema-level reference for a Dataset — surfacing structural details, quality metrics, version history, and access information without opening the data itself.

The Dataset Catalog is the schema-level reference for a dataset. It surfaces the structural, statistical, and governance information needed to understand what a dataset contains, who can access it, and how it has evolved over time — without needing to open the data itself.

Accessing the Catalog

  1. Navigate to Data Studio > Dataset.

  2. Locate the dataset within its Workspace and Collection and open it.

  3. Click Catalog in the Quick Links section at the top right of the dataset view.

The Catalog page opens, showing the dataset name in the header (e.g., Catalog - fin_bench_ingest) alongside key metadata and a set of tabbed sections.

Catalog Header

The top bar of every Catalog page displays a summary of the Dataset's key properties at a glance:

Field
Description

Status

The certification status of the Dataset (e.g., Uncertified).

Created date

The date the Dataset was first created (e.g., 12/08/2025).

Date range

The range of dates covered by the data in the Dataset (e.g., 12/08/2025 – 12/08/2025).

Format

The storage format of the Dataset (e.g., parquet).

Size

The total size of the Dataset on disk (e.g., 1.83 MB).

Version

The current version number of the Dataset.

Dictionary

The number of columns defined in the Dataset's data dictionary (e.g., 23 Columns).

View dataset

A shortcut to open the Dataset in the Table/List view.

Catalog Tabs

The Catalog is organised into five tabs:

Tab
Description

Overview

A summary of the Dataset including its description, user comments, usage and view statistics, Collection properties, tags, and similar datasets.

Version

The version history of the Dataset, reflecting schema changes over time.

Access

The Workspace, Collection, and permission settings that control who can view or modify the Dataset.

Headers

The complete column-level schema: name, data type, format, description, PII flag, and key definitions for every column.

Quality

A quality assessment of the Dataset, including completeness and consistency indicators.

Retention Policy

The data retention rules configured for the Dataset.

Overview Tab

The Overview tab provides a human-readable summary of the Dataset, including:

  • Description — A free-text description of the Dataset's purpose (editable via the pencil icon). Example: Directory created for pre-processing.

  • User Comments — A rating and comment thread from users who have interacted with the Dataset. Displayed as an average score out of 5.

  • Usage Stats — A time-series chart showing how frequently the Dataset has been used (executed or queried) over the selected period.

  • View Stats — A time-series chart showing how many times the Dataset has been viewed over the selected period.

  • Collection Properties — The underlying storage details:

Property
Example

Collection Name

s3_ashish_dataworkz_account

Path

s3a://dataworkz-genai-dev-lake

Storage Type

S3

  • Recent updates — The users who created and last modified the Dataset.

  • Tags — Labels assigned to the Dataset for discovery and categorisation (editable via the pencil icon).

  • Similar Datasets — Other datasets in the platform with similar schema or content (shown if available).

Headers Tab

The Headers tab is the primary schema reference for the Dataset. It lists every column with its full metadata definition.

Toggle between Header (column definitions) and Sample data (a preview of actual values) using the radio buttons at the top.

Column Definitions

Each row in the headers table describes one column:

Field
Description

Header

The column name as it appears in the Dataset.

Data Type

The data type of the column (e.g., Integer, String, Long).

Data Format

The format of the data within the column, if applicable (e.g., date formats).

Header Description

A human-readable description of the column's content.

PII Data

Whether the column contains Personally Identifiable Information (Y / N).

Unique Identifier

Whether the column serves as a unique identifier for records (Y / N).

Foreign key

Whether the column is a foreign key referencing another Dataset (Y / N).

Reference Table

The name of the referenced table if the column is a foreign key.

Semantics Tag

A semantic label automatically or manually assigned to the column (e.g., None).

Action

Edit (✎) or delete (🗑) the column definition.

Example columns from fin_bench_ingest:

Header
Data Type
PII Data
Unique Identifier
Foreign Key

page_no

Integer

N

N

N

source_file_name

String

N

N

N

source_text

String

N

N

N

source_creation_date

Long

N

N

N

total_pages

Integer

N

N

N

version

String

N

N

N

Click Edit in the top-right of the headers table to modify column definitions in bulk.

Version Tab

The Version tab shows the history of schema changes across all versions of the Dataset. Each version reflects a snapshot of the schema at the time it was created, allowing you to trace how columns and data types have evolved.

💡 Note: The Catalog reflects the schema at the current version. To compare schema changes across versions, use the Version tab within the Catalog view.

Access Tab

The Access tab shows the Workspace and Collection the Dataset belongs to, along with the permission settings that control which users or roles can view or modify it. Access is governed by the RBAC configured at the Workspace level.

Quality Tab

The Quality tab provides an automated assessment of the Dataset's data quality, including indicators for completeness (null/missing values) and consistency (data type conformance and pattern adherence) across columns.

Retention Policy Tab

The Retention Policy tab displays any data retention rules configured for the Dataset — for example, how long data is retained before automatic archival or deletion.

Last updated