I would appreciate your opinion on a join that I need to do but it's taking way too much time. I must be missing some important point here.
File 1:
118 rows (eventually should be 8,000,000+)
Dataset hash partitioned, sorted by key (not unique)
File 2:
21,000,000 rows
Dataset hash partitioned, sorted by key (unique)
Have a job that makes inner join between these 2 files and just the join part runs for 3+ hours.
DS file -
> Join -> Transformer -> Target HDFS
DS file -
Obviously I am missing something here because just join on sorted key files shouldn't take this long!?
Join on large file takes long time
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 87
- Joined: Sat Jul 14, 2007 2:24 pm
Are they sorted properly, i.e. to support the needs of the join? Does your job know that your input data sets are sorted? Dumping the score would tell you what's going on. I'll wager once you do you'll see the previous answer is "no" so it is resorting everything. Add Sort stages before the Join set to "Don't sort, already sorted" and see if that helps sort it out. Pun intended.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 87
- Joined: Sat Jul 14, 2007 2:24 pm
The files were sorted correctly according to the right keys.
But what did the magic was the dummy sort stages before the join, A-M-A-Z-I-N-G :D , thank you so much for the tip! I would not have figured out this workaround on my own (BTW, still, it makes me wonder isn't there a "proper" way to tell DS that the files are already sorted instead of the dummy sort stages?)
The join takes now 1.5 minutes instead of 3+ hours!
But what did the magic was the dummy sort stages before the join, A-M-A-Z-I-N-G :D , thank you so much for the tip! I would not have figured out this workaround on my own (BTW, still, it makes me wonder isn't there a "proper" way to tell DS that the files are already sorted instead of the dummy sort stages?)
The join takes now 1.5 minutes instead of 3+ hours!