Hash partitioning and sorting
Posted: Wed Nov 09, 2011 10:58 am
Following up on
viewtopic.php?t=142981
with a different focus.
So my understanding of the hash partitioning and sorting usage and best practice so far is,
EDITED, 5th edition:
Best practice & goal: choose as minimum keys to partition as possible as long as rows can be distributed evenly according to the values in key columns
Benefit of such practice (ie, choosing only minimum of the keys as the hash key(s)),
- more easier to optimize partitioning for the entire job flow
- help minimize the number of repartitions within and across job flows
rules:
Thanks
viewtopic.php?t=142981
with a different focus.
So my understanding of the hash partitioning and sorting usage and best practice so far is,
EDITED, 5th edition:
Best practice & goal: choose as minimum keys to partition as possible as long as rows can be distributed evenly according to the values in key columns
Benefit of such practice (ie, choosing only minimum of the keys as the hash key(s)),
- more easier to optimize partitioning for the entire job flow
- help minimize the number of repartitions within and across job flows
rules:
- - Generally speaking, Hash partitioning is required when stage requires grouping of related values (e.g. Aggregator stage)
- Hash partitioning is required for all stages that require matched key values
(e.g. Join, Merge, Remove Duplicates, etc)
- Hash keys has nothing to do with the fields that are marked as "keys"
- Hash keys can be only a subset of the matched/grouping keys, but
- Hash key must be at least one of the matched/grouping keys
- Hash partitioning is required to have the same has key for all streams that require matched key values
(e.g. Join, Merge, etc)
- All the rest of the matched keys should be sorted
- The grouping keys for aggregators do not need not be sorted
- - Can hash key be out side of matched keys? Guess not.
- Does the order of the sorting keys matter, as long as the order match each others?
Thanks