Parallel aggregator

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
perspicax
Premium Member
Premium Member
Posts: 26
Joined: Thu Dec 07, 2017 3:11 pm
Location: USA

Parallel aggregator

Post by perspicax »

Is pipe-lining possible when aggregator stage is used. For instance, I have a job using aggregator, that sums up colD by grouping 3 input columns, ColA, ColB, ColC, so before aggregator start spitting the rows out, it waits for all the input rows to be read.

Is there a way to make aggregator stage producing output even before all the input data is read?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sort the data to support the aggregation and then assert that it is sorted in the stage. It will bust you if you lie, be forewarned. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What Craig said.

The reason this "pipelines" the data is that the Aggregator stage does not have to build a table in memory to accumulate all the results. It can processing a single grouping key at a time, processing all the rows for that group (since the data are sorted by grouping key(s)), and immediately transfer that group result to the output line and free the memory that was being used to hold the results.

Thus Sort mode make minimal use of memory compared to Hash mode. Of course, some of that memory will be consumed sorting the data upstream of the Aggregator stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
perspicax
Premium Member
Premium Member
Posts: 26
Joined: Thu Dec 07, 2017 3:11 pm
Location: USA

Post by perspicax »

Thanks. It worked. Earlier, the sort order was different than order of group key column. I selected partition type 'Same'. I am not sure what it means. 'Auto' and 'Entire' didn't work. What is the recommended partition type?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm not sure there's a "recommended" so much as a default one but others can address that. The partitioning types need to be something you become very familiar with unless you only ever run your jobs on a single node. The moment you open that up to more than one node, how the data is partitioned can make or break things. Simplest case is it runs longer that it should. Worst case, your output data is wrong or incomplete. Or the job goes boom. :wink:

Perhaps this will help.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Same is recommended iff the upstream stage:
  • is executing in parallel mode
    is executing on the same nodes
    is partitioned using the correct algorithm
In the case of an Aggregator stage I would typically use a key-based partitioning algorithm (e.g. Hash or Modulus) on the first grouping key, and sort on all of the grouping keys in their correct order.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
perspicax
Premium Member
Premium Member
Posts: 26
Joined: Thu Dec 07, 2017 3:11 pm
Location: USA

Post by perspicax »

Yes the job is executing on 6 nodes with all the stages running on the same nodes. In this eg, the upstream stage is a transformer stage. The input stage is a Oracle stage where partitioned read is enabled (with default rowid range).

The group is performed on Year(kind of redundant because we only process current year and target inserts into current year partition only), month, id1, id2...Id10. That forms the grain of the target table. So Year is the first grouping key and sorting is in the same as grouping key order.

I did get a significant performance improvement with 'same' partition, aggregating 26 M rows in less than 4 min. But when I specified 'Hash' or 'Modulus', I do not see the pipe-lining and the job takes long time to complete
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How is the upstream Transformer stage partitioned?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
perspicax
Premium Member
Premium Member
Posts: 26
Joined: Thu Dec 07, 2017 3:11 pm
Location: USA

Post by perspicax »

The transformer is set to auto. So if 'hash' or 'modulus' is selected in Agg stage then the same partition type should be set in the transformer? I will change the settings and report back.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That would depend on what partitioning is used upstream of the Transformer stage. Your answer suggests an unfamiliarity with what partitioning does and achieves. Is there any explicit partitioning upstream in your job design?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

chulett wrote:Perhaps this will help.
Did you click on the included link? Partitioning is core, key knowledge required for success using the tool. IMHO.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply