Page 1 of 1

Relationship between Nodes and Partitions...

Posted: Sun Nov 18, 2018 11:37 pm
by riyazahmed.d
https://www.ibm.com/support/knowledgece ... ioner.html

Question-->

Does Hash Partition depend on the number of Nodes defined in the Configuration file? As per the example given on IBM page, it seems there is no relation between number of Nodes and Partitions, and a single node can have many Partitions.

If possible please explain using the example given on IBM page (URL attached above).

Posted: Mon Nov 19, 2018 1:46 am
by ray.wurlod
Welcome aboard.

Let's consider first the Modulus algorithm (applicable only to integers). The integer value is divided by the number of nodes and the remainder (obtained through the Mod() function) is the node number to which that particular value will be directed.

The only difference for the Hash algorithm is that it is applicable to any data type. All of the characters in the value are processed via an algorithm (something like adding all the ASCII values, though a little more complex than that) to produce a large integer value called the hash value. This is divided by the number of nodes and, as with Modulus algorithm, the remainder yields the node number to which that particular value will be directed.

Node numbers, like most values in the DataStage parallel engine, start from 0.

Posted: Mon Nov 19, 2018 2:00 am
by riyazahmed.d
Thanks Ray for the detailed explanation.

Below is the paragraph from the IBM page on Hash Partitioner:

"When hash partitioning, you should select hashing keys that create a large number of partitions. For example, hashing by the first two digits of a zip code produces a maximum of 100 partitions. This is not a large number for a parallel processing system. Instead, you could hash by five digits of the zip code to create up to 10,000 partitions. You also could combine a zip code hash with an age hash (assuming a maximum age of 190), to yield 1,500,000 possible partitions. "

As per your above explanation, if we have 4 nodes we get remainders 0,1,2,3 and accordingly data will be divided among 4 nodes but as per the IBM paragraph we can have "n" number of partitions depending upon the data in the hash key and it is saying 15 million partitions are possible.

Could you please explain using the example given on the IBM page?

Posted: Mon Nov 19, 2018 11:11 am
by UCDI
If the data isn't suited to partitions "as is", run some of it through a Checksum stage and use that field as the partition key. It should give solid results.

Posted: Mon Nov 19, 2018 1:59 pm
by jneasy
If the data isn't suited to partitions what is the point of using a checksum for partitioning? For example if you want to partition on say a gender code which potentially only has two distinct values and you checksum on the gender code you will still only have two distinct values produced from the checksum. The only difference you will have a much larger additional field.

If all you are worried about is even data distribution across your nodes and no requirement to rejoin then why not Round Robin?

Posted: Mon Nov 19, 2018 7:14 pm
by ray.wurlod
The IBM example was written by someone who doesn't fully understand what is happening or, perhaps, has a million partitions available.

In real life, your data can only be partitioned over the number of nodes available.

Posted: Mon Nov 19, 2018 10:39 pm
by riyazahmed.d
Thanks everyone!!

It is pretty much clear now :)