How to improve aggregator performance?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Alexander Scherbina
Participant
Posts: 2
Joined: Thu Dec 10, 2009 3:19 am

How to improve aggregator performance?

Post by Alexander Scherbina »

In my job I have to sum data in 120 columns grouping by 10 other columns. Total number of rows to aggregate is about 5-6 millions. All rows are sorted and partitioned in sort stage before aggregation. But aggregation performance still very low on Aggregator stage - only 2000-3000 rows/sec :(

I tried to use 5 and 8 node in configuration files, but this didn't significantly affect the performance. And it's strange to me, but we have only 20-30% CPU usage while running this job.

Without Aggregator stage we have excellent performance on this job - reading from datasets, sorting, joining, filtering, output to file etc. are very fast.

Maybe there are some project parameters or other for increase performance of aggregation?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard. What execution mode are you using for the Aggregator stage? What aggregation mode (sort or hash) are you using?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Alexander Scherbina
Participant
Posts: 2
Joined: Thu Dec 10, 2009 3:19 am

Post by Alexander Scherbina »

ray.wurlod wrote:Welcome aboard. What execution mode are you using for the Aggregator stage? What aggregation mode (sort or hash) are you using? ...
Execution mode is parralel. Aggregation mode is sort.

I forgot to mention that there is no warnings in the job log.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ignore rows/sec as a metric of performance because they are meaningless on the output of an Aggregator stage; the clock is running during all of the wait time while rows are coming in.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sridinesh2009
Participant
Posts: 14
Joined: Wed Nov 11, 2009 4:52 am
Location: New York

Post by sridinesh2009 »

in aggregator stage use this option... ur performance may increase

METHOD=Hash
Dinesh.D
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

sridinesh2009 wrote:in aggregator stage use this option... ur performance may increase

METHOD=Hash
Please don't use SMS/text style words as this is not a mobile phone.

As the OP said he has 5-6 million records to aggregate and Hash method is only used when number of records are less. In case you set method to hash it starts thowing warning when number of records reaches 16K mark. Also there are other implications when hash grows beyond a level.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply