Union
Guide to combining rows from two Datasets vertically using the Union operation in Data Prep.
The Union operation in Data Prep combines rows from two Datasets into a single unified Dataset using a visual, drag-and-drop canvas. Unlike a Join — which combines Datasets horizontally by matching columns — a Union stacks Datasets vertically, appending the rows of one Dataset to the other. The resulting Dataset can be saved to a target collection and executed as a one-time or recurring job.
When to Use
Use the Union operation when you need to:
Combine rows from two Datasets that share the same or similar structure into a single result set
Consolidate data from multiple sources — such as different time periods, regions, or systems — into one Dataset
Handle schema differences between Datasets by providing default values for missing fields
Accessing the Union Operation
Navigate to Data Studio → Data Prep → Union from the top navigation menu.
The Configure Union page opens with the Dataset Explorer in the left panel and the visual canvas on the right. A Union node is already present on the canvas ready to be connected.

Step-by-Step: Configuring a Union
Step 1: Add Datasets to the Canvas
In the left panel, expand the Dataset Explorer to locate your Datasets. Drag the first Dataset onto the canvas. Repeat for the second Dataset. Both Dataset nodes appear on the canvas alongside the Union node.

Step 2: Connect the Datasets to the Union Node
Connect each Dataset node to the Union node by dragging from the Dataset's output port to the Union node's input ports. Both connection lines appear on the canvas when the Datasets are successfully linked.
Step 3: Configure Date and Column Selection (Optional)
Click the edge connecting one of the Dataset nodes to the Union node. The Directory Date and Header panel opens on the right.

Select Data By
Choose Date Range to filter by a specific start and end time, or Time Interval to filter by a rolling interval
Start Time / End Time
Set the date and time boundaries for the data to include from this Dataset
Select Columns
Choose which columns from this Dataset to include in the Union result. Defaults to all columns selected
Click Add to apply the configuration. Repeat this step for the edge connecting the second Dataset to the Union node.
Step 4: Review and Resolve Column Mapping
Click the Union node on the canvas. A column mapping table appears at the bottom of the screen showing a superset of all columns from both Datasets.

Each column is listed with its availability across both Datasets:
Header name
The column name from the combined schema
Dataset 1
Indicates whether this column exists in the first Dataset — Yes or No
Dataset 2
Indicates whether this column exists in the second Dataset — Yes or No
Custom value
A custom default value for missing fields, if specified
Action
Shows Add missing data for columns that exist in one Dataset but not the other
Where a column exists in one Dataset but not the other, rows from the Dataset lacking that column will produce null values for that field by default. To specify a custom default value instead, click Add missing data.

The Add Header default Value dialog opens. Enter a Header value to use as the default for rows where this field is absent, then click OK.
💡 Note: Adding a default value for missing fields is optional. If left as null, rows from the Dataset lacking the column will have an empty value for that field in the output.
Step 5: Save and Configure Output
Once the Union is configured and column mapping is reviewed, click Save in the top right corner of the canvas. The Execute dialog opens with two steps — Target and Schedule.
Step 1 — Target

Workspace
Select the target workspace where the output Dataset will be stored
Store in Collection
Select the collection within the workspace
Choose Target Dataset
Select an existing Dataset as the output destination, or create a new one
Directory Format
Select the output file format — CSV or Parquet
Click Next to proceed to the Schedule step.
Step 2 — Schedule

Job Name
Enter a descriptive name for the job. A default name is generated automatically
Job Frequency
Select One time to run the job immediately once, or Recurring to configure a repeating schedule
Degree of Parallelism
Set the number of parallel tasks for job execution — default is 30
No. of cores
Set the number of CPU cores allocated to the job — default is 10
Click Submit to execute the job.
Monitoring the Job
After submitting, the job is queued for execution. Monitor its progress and status on the Job Monitoring screen in Data Studio.
Last updated

