Join on large file takes long time

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
asaf_arbely
Premium Member
Premium Member
Posts: 87
Joined: Sat Jul 14, 2007 2:24 pm

Join on large file takes long time

Post by asaf_arbely »

I would appreciate your opinion on a join that I need to do but it's taking way too much time. I must be missing some important point here.

File 1:
118 rows (eventually should be 8,000,000+)
Dataset hash partitioned, sorted by key (not unique)

File 2:
21,000,000 rows
Dataset hash partitioned, sorted by key (unique)

Have a job that makes inner join between these 2 files and just the join part runs for 3+ hours.
DS file -
> Join -> Transformer -> Target HDFS
DS file -

Obviously I am missing something here because just join on sorted key files shouldn't take this long!?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Are they sorted properly, i.e. to support the needs of the join? Does your job know that your input data sets are sorted? Dumping the score would tell you what's going on. I'll wager once you do you'll see the previous answer is "no" so it is resorting everything. Add Sort stages before the Join set to "Don't sort, already sorted" and see if that helps sort it out. Pun intended.
-craig

"You can never have too many knives" -- Logan Nine Fingers
asaf_arbely
Premium Member
Premium Member
Posts: 87
Joined: Sat Jul 14, 2007 2:24 pm

Post by asaf_arbely »

The files were sorted correctly according to the right keys.

But what did the magic was the dummy sort stages before the join, A-M-A-Z-I-N-G :D , thank you so much for the tip! I would not have figured out this workaround on my own (BTW, still, it makes me wonder isn't there a "proper" way to tell DS that the files are already sorted instead of the dummy sort stages?)

The join takes now 1.5 minutes instead of 3+ hours!
Post Reply