Partitioning in Lookup Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
r.bhatia
Participant
Posts: 11
Joined: Mon Jun 30, 2008 12:45 am
Location: Manchester

Partitioning in Lookup Stage

Post by r.bhatia »

In one of my jobs I am doing a lookup against a large dataset

The partitioning used at both ends (reference and main) are hash on same keys.

It has been working fine so far, but wanted to know if this is a risk?

I know that entire partitioning is recommended for reference link but our dataset being very large does not seem to be a good option.

Please advise.

Rakesh
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You are 100% correct to use Hash partitioning algorithm on the same keys on both stream and reference inputs. It is totally risk free if done correctly.

On a single machine, Entire on the reference input isn't much of an overhead because a single copy of the reference data set is put into shared memory.

However in a multiple machine configuration Entire involves sending all rows of the reference data set to all nodes defined in the configuration file, via TCP/IP, which might be a substantial.

Therefore it is particularly the case for multi-machine configurations that you should use identical key-based partitioning (hash or modulus) on the same keys on both stream reference inputs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
satyam_ps
Participant
Posts: 13
Joined: Sun Apr 27, 2008 5:11 am
Location: Bangalore

strange problem encountered in the same context

Post by satyam_ps »

i have a lookup stage where in i am hash partitioning the main and reference link data on a certain key in the input of the lookup stage.to my surprise look up failed since the values inspite being same on main and reference links were in different partitions.introducing a peek stage after modifying the setup a bit and partioning the data before peek i found that the value '1-6108676' coming from main link was put into partition 0 and the one from reference link was put into partition 1.due to this the lookup was not finding a match.

but another thing which is to be noted is that when i am doing an internal sort and partition within the input of the lookup stage for both the main and ref links the lookup passes successfully.
can anyone tell me why this is happening?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I can only surmise that the hash keys were not identically specified and/or were not the same as the reference key.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply