AggregatorStageTransformer
Overview
Transforms DataStage AGGREGATOR stages to SQL GROUP BY statements. Performs grouping and aggregation operations (COUNT, SUM, AVG, MAX, MIN).
Stage Type: AGGREGATOR (group operator)
Output: SQL SELECT with GROUP BY
Capabilities
- Single and multiple grouping keys
- COUNT(*) aggregation
- SUM, AVG, MAX, MIN aggregations
- Automatic output schema generation
- Supports composite GROUP BY clauses
DataStage Stage Example
Input (.dsx format)
#### STAGE: AggregateData
## Operator
group
## Operator options
-hash
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
-countField 'SALES_COUNT'
-sumField 'SALE_AMOUNT' 'TOTAL_SALES'
-avgField 'SALE_AMOUNT' 'AVG_SALES'
## General options
[ident('AggregateData'); jobmon_ident('AggregateData')]
## Inputs
0< [] 'SalesData:input.v'
## Outputs
0> [] 'AggregateData:Output.v'
Input Schema
[
{"name": "CUSTOMER_ID", "type": "string"},
{"name": "PRODUCT_ID", "type": "string"},
{"name": "SALE_AMOUNT", "type": "number"},
{"name": "SALE_DATE", "type": "date"}
]
Generated SQL Output
SELECT
CUSTOMER_ID,
PRODUCT_ID,
COUNT(*) AS SALES_COUNT,
SUM(SALE_AMOUNT) AS TOTAL_SALES,
AVG(SALE_AMOUNT) AS AVG_SALES
FROM salesdata_V0S8
GROUP BY CUSTOMER_ID, PRODUCT_ID
Output Schema
[
{"name": "CUSTOMER_ID", "type": "string"},
{"name": "PRODUCT_ID", "type": "string"},
{"name": "SALES_COUNT", "type": "number"},
{"name": "TOTAL_SALES", "type": "number"},
{"name": "AVG_SALES", "type": "number"}
]
How It Works
1. Identify Grouping Keys
Parses -key options to identify GROUP BY columns:
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
Generates:
GROUP BY CUSTOMER_ID, PRODUCT_ID
2. Extract Aggregation Functions
Parses aggregation options:
| Option | SQL | Example |
|---|---|---|
-countField 'ROW_COUNT' | COUNT(*) AS ROW_COUNT | Count rows per group |
-sumField 'AMOUNT' 'TOTAL' | SUM(AMOUNT) AS TOTAL | Sum of AMOUNT column |
-avgField 'AMOUNT' 'AVG_AMT' | AVG(AMOUNT) AS AVG_AMT | Average of AMOUNT |
-maxField 'VALUE' 'MAX_VAL' | MAX(VALUE) AS MAX_VAL | Maximum VALUE |
-minField 'VALUE' 'MIN_VAL' | MIN(VALUE) AS MIN_VAL | Minimum VALUE |
3. Build SELECT Clause
Combines grouping keys and aggregations:
SELECT
<grouping_key_1>,
<grouping_key_2>,
COUNT(*) AS <count_field>,
SUM(<input_col>) AS <sum_field>,
...
FROM input_table
GROUP BY <grouping_key_1>, <grouping_key_2>
4. Infer Output Schema
- Grouping keys inherit types from input schema
- COUNT fields typed as INTEGER
- SUM/AVG fields typed as DECIMAL
- MAX/MIN fields inherit input column type
Aggregation Functions
COUNT
-countField 'ROW_COUNT'
Generates:
COUNT(*) AS ROW_COUNT
Counts rows per group.
SUM
-sumField 'SALE_AMOUNT' 'TOTAL_SALES'
Generates:
SUM(SALE_AMOUNT) AS TOTAL_SALES
Sums values of SALE_AMOUNT column per group.
AVG
-avgField 'SALE_AMOUNT' 'AVG_SALES'
Generates:
AVG(SALE_AMOUNT) AS AVG_SALES
Calculates average of SALE_AMOUNT per group.
MAX
-maxField 'SALE_DATE' 'LATEST_DATE'
Generates:
MAX(SALE_DATE) AS LATEST_DATE
Finds maximum value per group.
MIN
-minField 'SALE_DATE' 'EARLIEST_DATE'
Generates:
MIN(SALE_DATE) AS EARLIEST_DATE
Finds minimum value per group.
Single Key Example
Input
-key 'REGION_CODE'
-countField 'REGION_COUNT'
Generated SQL
SELECT
REGION_CODE,
COUNT(*) AS REGION_COUNT
FROM input_table
GROUP BY REGION_CODE
Composite Key Example
Input
-key 'REGION_CODE'
-key 'CATEGORY_CODE'
-key 'PERIOD_CODE'
-countField 'ROW_COUNT'
Generated SQL
SELECT
REGION_CODE,
CATEGORY_CODE,
PERIOD_CODE,
COUNT(*) AS ROW_COUNT
FROM input_table
GROUP BY REGION_CODE, CATEGORY_CODE, PERIOD_CODE
Multiple Aggregations
You can combine multiple aggregation functions:
Input
-key 'CUSTOMER_ID'
-countField 'ORDER_COUNT'
-sumField 'ORDER_VALUE' 'TOTAL_VALUE'
-avgField 'ORDER_VALUE' 'AVG_VALUE'
-maxField 'ORDER_DATE' 'LAST_ORDER_DATE'
-minField 'ORDER_DATE' 'FIRST_ORDER_DATE'
Generated SQL
SELECT
CUSTOMER_ID,
COUNT(*) AS ORDER_COUNT,
SUM(ORDER_VALUE) AS TOTAL_VALUE,
AVG(ORDER_VALUE) AS AVG_VALUE,
MAX(ORDER_DATE) AS LAST_ORDER_DATE,
MIN(ORDER_DATE) AS FIRST_ORDER_DATE
FROM input_table
GROUP BY CUSTOMER_ID
Limitations
- ⚠️ No HAVING clause - Post-aggregation filters not supported (use downstream FILTER stage)
- ⚠️ No complex expressions - Aggregations only work on simple column references, not expressions
- ⚠️ No DISTINCT aggregations - COUNT(DISTINCT col) not supported
- ⚠️ Hash partitioning ignored -
-hashoption is DataStage-specific and not converted to SQL - ⚠️ No ROLLUP/CUBE - Advanced grouping sets not supported
Related Transformers
- TransformerStageTransformer - For post-aggregation column derivations
- ModifyStageTransformer - For post-aggregation column filtering
- RemoveDuplicatesTransformer - For simple deduplication without aggregation