Union

Guide to combining rows from two Datasets vertically using the Union operation in Data Prep.

The Union operation in Data Prep combines rows from two Datasets into a single unified Dataset using a visual, drag-and-drop canvas. Unlike a Join — which combines Datasets horizontally by matching columns — a Union stacks Datasets vertically, appending the rows of one Dataset to the other. The resulting Dataset can be saved to a target collection and executed as a one-time or recurring job.

When to Use

Use the Union operation when you need to:

  • Combine rows from two Datasets that share the same or similar structure into a single result set

  • Consolidate data from multiple sources — such as different time periods, regions, or systems — into one Dataset

  • Handle schema differences between Datasets by providing default values for missing fields

Accessing the Union Operation

Navigate to Data Studio → Data Prep → Union from the top navigation menu.

The Configure Union page opens with the Dataset Explorer in the left panel and the visual canvas on the right. A Union node is already present on the canvas ready to be connected.

Step-by-Step: Configuring a Union

Step 1: Add Datasets to the Canvas

In the left panel, expand the Dataset Explorer to locate your Datasets. Drag the first Dataset onto the canvas. Repeat for the second Dataset. Both Dataset nodes appear on the canvas alongside the Union node.

Step 2: Connect the Datasets to the Union Node

Connect each Dataset node to the Union node by dragging from the Dataset's output port to the Union node's input ports. Both connection lines appear on the canvas when the Datasets are successfully linked.

Step 3: Configure Date and Column Selection (Optional)

Click the edge connecting one of the Dataset nodes to the Union node. The Directory Date and Header panel opens on the right.

Field
Description

Select Data By

Choose Date Range to filter by a specific start and end time, or Time Interval to filter by a rolling interval

Start Time / End Time

Set the date and time boundaries for the data to include from this Dataset

Select Columns

Choose which columns from this Dataset to include in the Union result. Defaults to all columns selected

Click Add to apply the configuration. Repeat this step for the edge connecting the second Dataset to the Union node.

Step 4: Review and Resolve Column Mapping

Click the Union node on the canvas. A column mapping table appears at the bottom of the screen showing a superset of all columns from both Datasets.

Each column is listed with its availability across both Datasets:

Column
Description

Header name

The column name from the combined schema

Dataset 1

Indicates whether this column exists in the first Dataset — Yes or No

Dataset 2

Indicates whether this column exists in the second Dataset — Yes or No

Custom value

A custom default value for missing fields, if specified

Action

Shows Add missing data for columns that exist in one Dataset but not the other

Where a column exists in one Dataset but not the other, rows from the Dataset lacking that column will produce null values for that field by default. To specify a custom default value instead, click Add missing data.

The Add Header default Value dialog opens. Enter a Header value to use as the default for rows where this field is absent, then click OK.

💡 Note: Adding a default value for missing fields is optional. If left as null, rows from the Dataset lacking the column will have an empty value for that field in the output.

Step 5: Save and Configure Output

Once the Union is configured and column mapping is reviewed, click Save in the top right corner of the canvas. The Execute dialog opens with two steps — Target and Schedule.

Step 1 — Target

Field
Description

Workspace

Select the target workspace where the output Dataset will be stored

Store in Collection

Select the collection within the workspace

Choose Target Dataset

Select an existing Dataset as the output destination, or create a new one

Directory Format

Select the output file format — CSV or Parquet

Click Next to proceed to the Schedule step.

Step 2 — Schedule

Field
Description

Job Name

Enter a descriptive name for the job. A default name is generated automatically

Job Frequency

Select One time to run the job immediately once, or Recurring to configure a repeating schedule

Degree of Parallelism

Set the number of parallel tasks for job execution — default is 30

No. of cores

Set the number of CPU cores allocated to the job — default is 10

Click Submit to execute the job.

Monitoring the Job

After submitting, the job is queued for execution. Monitor its progress and status on the Job Monitoring screen in Data Studio.

Last updated