Skip to main content

AggregatorStageTransformer

Overview

Transforms DataStage AGGREGATOR stages to SQL GROUP BY statements. Performs grouping and aggregation operations (COUNT, SUM, AVG, MAX, MIN).

Stage Type: AGGREGATOR (group operator) Output: SQL SELECT with GROUP BY

Capabilities

  • Single and multiple grouping keys
  • COUNT(*) aggregation
  • SUM, AVG, MAX, MIN aggregations
  • Automatic output schema generation
  • Supports composite GROUP BY clauses

DataStage Stage Example

Input (.dsx format)

#### STAGE: AggregateData
## Operator
group
## Operator options
-hash
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
-countField 'SALES_COUNT'
-sumField 'SALE_AMOUNT' 'TOTAL_SALES'
-avgField 'SALE_AMOUNT' 'AVG_SALES'
## General options
[ident('AggregateData'); jobmon_ident('AggregateData')]
## Inputs
0< [] 'SalesData:input.v'
## Outputs
0> [] 'AggregateData:Output.v'

Input Schema

[
{"name": "CUSTOMER_ID", "type": "string"},
{"name": "PRODUCT_ID", "type": "string"},
{"name": "SALE_AMOUNT", "type": "number"},
{"name": "SALE_DATE", "type": "date"}
]

Generated SQL Output

SELECT
CUSTOMER_ID,
PRODUCT_ID,
COUNT(*) AS SALES_COUNT,
SUM(SALE_AMOUNT) AS TOTAL_SALES,
AVG(SALE_AMOUNT) AS AVG_SALES
FROM salesdata_V0S8
GROUP BY CUSTOMER_ID, PRODUCT_ID

Output Schema

[
{"name": "CUSTOMER_ID", "type": "string"},
{"name": "PRODUCT_ID", "type": "string"},
{"name": "SALES_COUNT", "type": "number"},
{"name": "TOTAL_SALES", "type": "number"},
{"name": "AVG_SALES", "type": "number"}
]

How It Works

1. Identify Grouping Keys

Parses -key options to identify GROUP BY columns:

-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'

Generates:

GROUP BY CUSTOMER_ID, PRODUCT_ID

2. Extract Aggregation Functions

Parses aggregation options:

OptionSQLExample
-countField 'ROW_COUNT'COUNT(*) AS ROW_COUNTCount rows per group
-sumField 'AMOUNT' 'TOTAL'SUM(AMOUNT) AS TOTALSum of AMOUNT column
-avgField 'AMOUNT' 'AVG_AMT'AVG(AMOUNT) AS AVG_AMTAverage of AMOUNT
-maxField 'VALUE' 'MAX_VAL'MAX(VALUE) AS MAX_VALMaximum VALUE
-minField 'VALUE' 'MIN_VAL'MIN(VALUE) AS MIN_VALMinimum VALUE

3. Build SELECT Clause

Combines grouping keys and aggregations:

SELECT
<grouping_key_1>,
<grouping_key_2>,
COUNT(*) AS <count_field>,
SUM(<input_col>) AS <sum_field>,
...
FROM input_table
GROUP BY <grouping_key_1>, <grouping_key_2>

4. Infer Output Schema

  • Grouping keys inherit types from input schema
  • COUNT fields typed as INTEGER
  • SUM/AVG fields typed as DECIMAL
  • MAX/MIN fields inherit input column type

Aggregation Functions

COUNT

-countField 'ROW_COUNT'

Generates:

COUNT(*) AS ROW_COUNT

Counts rows per group.

SUM

-sumField 'SALE_AMOUNT' 'TOTAL_SALES'

Generates:

SUM(SALE_AMOUNT) AS TOTAL_SALES

Sums values of SALE_AMOUNT column per group.

AVG

-avgField 'SALE_AMOUNT' 'AVG_SALES'

Generates:

AVG(SALE_AMOUNT) AS AVG_SALES

Calculates average of SALE_AMOUNT per group.

MAX

-maxField 'SALE_DATE' 'LATEST_DATE'

Generates:

MAX(SALE_DATE) AS LATEST_DATE

Finds maximum value per group.

MIN

-minField 'SALE_DATE' 'EARLIEST_DATE'

Generates:

MIN(SALE_DATE) AS EARLIEST_DATE

Finds minimum value per group.

Single Key Example

Input

-key 'REGION_CODE'
-countField 'REGION_COUNT'

Generated SQL

SELECT
REGION_CODE,
COUNT(*) AS REGION_COUNT
FROM input_table
GROUP BY REGION_CODE

Composite Key Example

Input

-key 'REGION_CODE'
-key 'CATEGORY_CODE'
-key 'PERIOD_CODE'
-countField 'ROW_COUNT'

Generated SQL

SELECT
REGION_CODE,
CATEGORY_CODE,
PERIOD_CODE,
COUNT(*) AS ROW_COUNT
FROM input_table
GROUP BY REGION_CODE, CATEGORY_CODE, PERIOD_CODE

Multiple Aggregations

You can combine multiple aggregation functions:

Input

-key 'CUSTOMER_ID'
-countField 'ORDER_COUNT'
-sumField 'ORDER_VALUE' 'TOTAL_VALUE'
-avgField 'ORDER_VALUE' 'AVG_VALUE'
-maxField 'ORDER_DATE' 'LAST_ORDER_DATE'
-minField 'ORDER_DATE' 'FIRST_ORDER_DATE'

Generated SQL

SELECT
CUSTOMER_ID,
COUNT(*) AS ORDER_COUNT,
SUM(ORDER_VALUE) AS TOTAL_VALUE,
AVG(ORDER_VALUE) AS AVG_VALUE,
MAX(ORDER_DATE) AS LAST_ORDER_DATE,
MIN(ORDER_DATE) AS FIRST_ORDER_DATE
FROM input_table
GROUP BY CUSTOMER_ID

Limitations

  • ⚠️ No HAVING clause - Post-aggregation filters not supported (use downstream FILTER stage)
  • ⚠️ No complex expressions - Aggregations only work on simple column references, not expressions
  • ⚠️ No DISTINCT aggregations - COUNT(DISTINCT col) not supported
  • ⚠️ Hash partitioning ignored - -hash option is DataStage-specific and not converted to SQL
  • ⚠️ No ROLLUP/CUBE - Advanced grouping sets not supported
  • TransformerStageTransformer - For post-aggregation column derivations
  • ModifyStageTransformer - For post-aggregation column filtering
  • RemoveDuplicatesTransformer - For simple deduplication without aggregation