In one of my jobs I am doing a lookup against a large dataset
The partitioning used at both ends (reference and main) are hash on same keys.
It has been working fine so far, but wanted to know if this is a risk?
I know that entire partitioning is recommended for reference link but our dataset being very large does not seem to be a good option.
Please advise.
Rakesh
Partitioning in Lookup Stage
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You are 100% correct to use Hash partitioning algorithm on the same keys on both stream and reference inputs. It is totally risk free if done correctly.
On a single machine, Entire on the reference input isn't much of an overhead because a single copy of the reference data set is put into shared memory.
However in a multiple machine configuration Entire involves sending all rows of the reference data set to all nodes defined in the configuration file, via TCP/IP, which might be a substantial.
Therefore it is particularly the case for multi-machine configurations that you should use identical key-based partitioning (hash or modulus) on the same keys on both stream reference inputs.
On a single machine, Entire on the reference input isn't much of an overhead because a single copy of the reference data set is put into shared memory.
However in a multiple machine configuration Entire involves sending all rows of the reference data set to all nodes defined in the configuration file, via TCP/IP, which might be a substantial.
Therefore it is particularly the case for multi-machine configurations that you should use identical key-based partitioning (hash or modulus) on the same keys on both stream reference inputs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
strange problem encountered in the same context
i have a lookup stage where in i am hash partitioning the main and reference link data on a certain key in the input of the lookup stage.to my surprise look up failed since the values inspite being same on main and reference links were in different partitions.introducing a peek stage after modifying the setup a bit and partioning the data before peek i found that the value '1-6108676' coming from main link was put into partition 0 and the one from reference link was put into partition 1.due to this the lookup was not finding a match.
but another thing which is to be noted is that when i am doing an internal sort and partition within the input of the lookup stage for both the main and ref links the lookup passes successfully.
can anyone tell me why this is happening?
but another thing which is to be noted is that when i am doing an internal sort and partition within the input of the lookup stage for both the main and ref links the lookup passes successfully.
can anyone tell me why this is happening?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: