Hash partitioning and sorting

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Hash partitioning and sorting

Post by fmou »

Following up on
viewtopic.php?t=142981
with a different focus.

So my understanding of the hash partitioning and sorting usage and best practice so far is,

EDITED, 5th edition:

Best practice & goal: choose as minimum keys to partition as possible as long as rows can be distributed evenly according to the values in key columns

Benefit of such practice (ie, choosing only minimum of the keys as the hash key(s)),

- more easier to optimize partitioning for the entire job flow
- help minimize the number of repartitions within and across job flows

rules:
  • - Generally speaking, Hash partitioning is required when stage requires grouping of related values (e.g. Aggregator stage)
    - Hash partitioning is required for all stages that require matched key values
    (e.g. Join, Merge, Remove Duplicates, etc)
    - Hash keys has nothing to do with the fields that are marked as "keys"
    - Hash keys can be only a subset of the matched/grouping keys, but
    - Hash key must be at least one of the matched/grouping keys
    - Hash partitioning is required to have the same has key for all streams that require matched key values
    (e.g. Join, Merge, etc)
    - All the rest of the matched keys should be sorted
    - The grouping keys for aggregators do not need not be sorted
Previous questions,
  • - Can hash key be out side of matched keys? Guess not.
    - Does the order of the sorting keys matter, as long as the order match each others?
Comment?

Thanks
Last edited by fmou on Sat Nov 12, 2011 11:14 am, edited 5 times in total.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

As long as all the data that you need to aggregate or join ends up in the same partition, there's no strict rule that it must be one of the key columns, although that would be a little unusual. For instance, if you are joining or aggregating by House Number and Street Name, you could probably partition by Postcode. Maybe not though. It's hard to think of an example. And no, which of the composite key values you partiton by is not relevant either, if you're sorting or joining or aggregating on age and gender, then you can partition using either, but age is clearly the superior choice due to the greater diversity of values.
Phil Hibbs | Capgemini
Technical Consultant
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If you want to aggregate based on ZipCode and Sate, Its not right to partition based on ZipCode,Stage and City.
Unless you planned to have one extra round of Aggregation.


"- Does the order of the sorting keys matter, as long as the order match each others?" You are good if both links matches the order.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Post by fmou »

there's no strict rule that it must be one of the key columns
hmm..., I was thinking of the fully normalized BNF3 tables, in which no any sub group of the fields in the table is functionally dependent on any fields other than the candidate keys.
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Post by fmou »

hmm.... hold on, things now get complicated.

Say I need to join two streams (A & B) with the same join keys student_id and course_id.

stream A is hashed on student_id, and
stream B is hashed on course_id,

But when coming into the join stage, they all both sort on student_id first and then course_id.

According to my above theory, it should work. But is it really?

Moreover, if they both hashed on a single key but not both, when doing Parallel Sort (ie, not global sort), how can I make sure that the same student_id and course_id combination will end up in the same partition?

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can't. In this scenario there may be students doing some courses but not others, and courses in which some students but not others are enrolled. Because the sources are different (student-centric and course-centric) you're probably safest to partition on both join keys.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

fmou wrote:stream A is hashed on student_id, and
stream B is hashed on course_id,

But when coming into the join stage, they all both sort on student_id first and then course_id.

According to my above theory, it should work. But is it really?
No. Sttream A Student X may be in Partiton 1, and Stream B Course Y may be in Partition 2, so there's no way to join Student X to Course Y because they are in different partitions. You would need to repartition one of the sets on the other key.
Phil Hibbs | Capgemini
Technical Consultant
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

Perhaps even more than "will it work" is, "will it be understandable". If you explicitly partition, then I recommend that you annotate your job with partitioning information. So your join should have a note "Requires Link_StreamA to already be partitioned on Student_id", and it will be obvious from the presence of a repartitioning icon on the link that the Link_StreamB is being repartitioned to match. This is especially true if you are writing a Data Set with a particular kind of partitioning that is relied on in a subsequent job.
Phil Hibbs | Capgemini
Technical Consultant
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Post by fmou »

thanks everyone for the replies.

I'll amend my best practice, adding the following:

- Hash partitioning is required to have the same has key for all streams that require matched key values
(e.g. Join, Merge, etc)
Post Reply