Partioning algorithm for order of the data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sb2212
Participant
Posts: 36
Joined: Mon Apr 28, 2008 1:22 am

Partioning algorithm for order of the data

Post by sb2212 »

Hi,

I have a datastage job which runs on a 3 node config and the flow of which is-
Seq File -> Column Import -> Filter -> Column Import -> Copy -> Custom made operator. Column Import, Copy and Filter stage run in default parallel execution mode and Custom made operator runs in sequential mode.

Could anyone help me out with the best partioning algorithm for each stage which would help me getting the output data in the same order as that of the source. Source is a fixed width file and the target is a csv and its a plain 1-1 mapping.

I tried keeping auto but the order of the source and the target data was different.
sb2212
Participant
Posts: 36
Joined: Mon Apr 28, 2008 1:22 am

Post by sb2212 »

I tried 2 ways- 1st Column Import and Custom made operator as Round Robin and Filter, 2nd Colimn import and Copy stage as same.

Another way was - 1st Column Import and Custom made operator as Round Robin , filter as hash and 2nd Custom Import and Copy as Same.

Both of these are not yeilding the right order and thereason is due to the filter stage which is filtering header and trailer information from the data.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Partitioning distributes data across your processing nodes. As long as you don't introduce a hash partition in the flow, it might be possible to specify a collection method at your final sequential file stage to "undo" the partitioning, but I personally wouldn't even bother to try that.

You have a couple of simple options:
1) Run with a 1 node configuration file or change the exeution mode of all stages to sequential.
2) Have the initial sequential file stage generate a row number and then sort the data by that row number prior to the final sequential file stage.

Mike
sb2212
Participant
Posts: 36
Joined: Mon Apr 28, 2008 1:22 am

Post by sb2212 »

Thank you Mike for the solution. Executing the stages in sequential mode solved the issue.
But a question here on the "undo" of partitioning in the final stage. We dont have any "undo" collection method in the final sequential stage.

I am marking this post as resolved.
sb2212
Participant
Posts: 36
Joined: Mon Apr 28, 2008 1:22 am

Post by sb2212 »

I realised that the order was going wrong after the filter stage which actually filters the header/trailer.
Hence, 1st CI and Filter stage were executed in Sequential order and the 2nd CI and the custom made stage were executed parallely using RR algoritm and the copy stage in between was run parallely using same algorithm.
This produced the correct sort order of the input and the output data.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

I don't think output of round robin algorithm is reliable as far as the sort order is concerned. It doesn't have anything that preserves the sort order. Round robin partitioner and collector may give you correct result but are not reliable. I would suggest a sort merge collector instead or running it in sequential mode for not messing up with the current sort order.

PS: Undo doesn't mean that its an algorithm, it is to undo the current partitioning i.e. to remove it.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply