Big integrate - joins

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rajalaa
Participant
Posts: 4
Joined: Tue Mar 25, 2014 1:11 am
Location: Chennai

Big integrate - joins

Post by rajalaa »

Hello,
I have datastage job which reads data from BDFS stage. The underlying HQL is currently reading data from single table (Hive Hql). All the other lookup / joins are done in Datastage job.
What would be the advantage if I perform the joins with multiple tables in Hive hql and have the data read from BDFS file stage instead of performing the joins in Datastage after reading data from single table?

Thanks.
Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm going to give some general advice here on the topic, meaning nothing specific to BDFS or Hive. The general rule of thumb is you do joins within the job flow when you need to join either disparate data sources or sources like database tables that come from separate systems. As a corollary to that, if you've got multiple tables in the same relational source, do the joins there - with a caveat. I've been at clients where the source system is the weakest link in the chain and we weren't "allowed" to do joins there for fear of impacting performance. So in those cases, we simply streamed data out as gently as we could, loaded them into our processing or target database and did the joins there if possible. Otherwise, they were done in job.

Hope some of that helps. IMHO, well worth the experiment to give it a shot and see if / how much it helps in your specific case.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply