Partitioning in Lookup Stage

r.bhatia · Post by **r.bhatia** » Fri Jan 16, 2009 1:26 am

In one of my jobs I am doing a lookup against a large dataset

The partitioning used at both ends (reference and main) are hash on same keys.

It has been working fine so far, but wanted to know if this is a risk?

I know that entire partitioning is recommended for reference link but our dataset being very large does not seem to be a good option.

Please advise.

Rakesh

ray.wurlod · Post by **ray.wurlod** » Fri Jan 16, 2009 5:30 am

You are 100% correct to use Hash partitioning algorithm on the same keys on both stream and reference inputs. It is totally risk free if done correctly.

On a single machine, Entire on the reference input isn't much of an overhead because a single copy of the reference data set is put into shared memory.

However in a multiple machine configuration Entire involves sending all rows of the reference data set to all nodes defined in the configuration file, via TCP/IP, which might be a substantial.

Therefore it is particularly the case for multi-machine configurations that you should use identical key-based partitioning (hash or modulus) on the same keys on both stream reference inputs.

satyam_ps · Post by **satyam_ps** » Fri Jan 16, 2009 11:02 am

i have a lookup stage where in i am hash partitioning the main and reference link data on a certain key in the input of the lookup stage.to my surprise look up failed since the values inspite being same on main and reference links were in different partitions.introducing a peek stage after modifying the setup a bit and partioning the data before peek i found that the value '1-6108676' coming from main link was put into partition 0 and the one from reference link was put into partition 1.due to this the lookup was not finding a match.

but another thing which is to be noted is that when i am doing an internal sort and partition within the input of the lookup stage for both the main and ref links the lookup passes successfully.
can anyone tell me why this is happening?

ray.wurlod · Post by **ray.wurlod** » Fri Jan 16, 2009 3:36 pm

I can only surmise that the hash keys were not identically specified and/or were not the same as the reference key.

DSXchange

Partitioning in Lookup Stage

Partitioning in Lookup Stage

strange problem encountered in the same context