DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
asorrell
Site Admin

Group memberships:
Premium Members, DSXchange Team, Inner Circle, Server to Parallel Transition Group

Joined: 04 Apr 2003
Posts: 1698
Location: Colleyville, Texas
Points: 23116

Post Posted: Sun Jan 05, 2020 6:49 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
We have a situation where we are sorting a moderately large dataset on about 100 nodes. However, we want to send the resulting sorted results to a join with "Entire" for the left leg of a right outer join.

-YES- I know this sounds odd, but trust me, there are some extreme variants in the data, and this is getting us around them in a very performant manner (so far).

The question is: If we sort on 100 nodes, then send it straight to the join, with "Entire" selected, will DataStage keep all the records in sorted order as it consolidates the data from all 100 nodes and expands copies out to all the 100 nodes?

As an alternative, I know we can consolidate from the sort down to a single-threaded copy stage, which will keep the results in sorted order. Then sending that out to the join as "Entire" will keep the order in place.

_________________
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
asorrell
Site Admin

Group memberships:
Premium Members, DSXchange Team, Inner Circle, Server to Parallel Transition Group

Joined: 04 Apr 2003
Posts: 1698
Location: Colleyville, Texas
Points: 23116

Post Posted: Sun Jan 05, 2020 10:22 am Reply with quote    Back to top    

And the answer seems to be "no". We ran a test and it looks like it dropped about 5 million records, so I'm assuming that DataStage doesn't keep the records in sorted order as it goes from many (Hash) to many (entire).

Inserting a single-threaded copy stage works, but is slow (run time goes from 20 minutes to 1.2 hours).

/sigh....

The original problem is that the partitioned data is too "clumpy", causing some of the partitions get a massive amount of data and others get very little. We found that the "right" leg of the join could add an additional partitioning key to get a finer grain and spread the data out. However the left leg doesn't have an equivalent key, so we need to use "Entire" on that leg of the join to get the elements to match up.

_________________
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Rate this response:  
UCDI



Group memberships:
Premium Members

Joined: 21 Mar 2016
Posts: 380

Points: 3906

Post Posted: Tue Jan 07, 2020 11:39 am Reply with quote    Back to top    

I can't speak to solving the issue as you describe it.
but if you did something robust to the keys yourself, like a SHA of the keys that you have on both sides, then it should redistribute clumpy keys evenly and make it perform solidly.

I don't think there is anything usable built in; you may need to download and compile a program to generate workable hashes. CRC won't cut it.
Rate this response:  
Not yet rated
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1286

Points: 8453

Post Posted: Wed Jan 08, 2020 12:10 pm Reply with quote    Back to top    

Couldn't you just do a join with it being set to sequential rather than parallel?
Rate this response:  
Not yet rated
asorrell
Site Admin

Group memberships:
Premium Members, DSXchange Team, Inner Circle, Server to Parallel Transition Group

Joined: 04 Apr 2003
Posts: 1698
Location: Colleyville, Texas
Points: 23116

Post Posted: Wed Jan 15, 2020 9:34 am Reply with quote    Back to top    

PaulVL wrote:
Couldn't you just do a join with it being set to sequential rather than parallel?


Nope - its processing in excess of a billion rows, sometimes up to double digit billions!

Found a solution by the way. Used a partitioned Sort stage, fed it to a sequential Copy stage using "Sort-Merge" Collection ( without "Sort" checked). Then sent it to the left side of the Join stage with "Entire".

Sort processed quickly due to being run on 100+ nodes in parallel. The Copy stage did a decent job of consolidating the data in order and feeding it out to the Join partitions as quickly as it comes in.

Having to single-thread the Copy slowed the job down a bit, but overall performance has improved dramatically.

All the changes have reduced runtimes by 95%. Job was running in hours and is now completing in minutes.

Some of the redesign I recommended also reduced memory / scratch dramatically as well. The job used to have a problem with Sorts running out of scratch. Now they almost all run in memory. That allows the job to use the "Entire" strategy to trade off increased memory usage for excellent data distribution and 10x+ throughput in the Join stage.

_________________
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 43053
Location: Denver, CO
Points: 222300

Post Posted: Wed Jan 15, 2020 2:59 pm Reply with quote    Back to top    

Nice!

_________________
-craig

"May the bridges I burn light my way forward"
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours