aggregator sort method issue

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

aggregator sort method issue

Post by KEvens »

Hi.

I have searched the forums but I couldn't find this exact issue.

We have a job that uses the aggregator and it has 12 group (by) elements. Initially, the job ran with the aggregator 'hash' method, but then it ran into the heap problems, similar to those mentioned in other threads in this forum. As mentioned there, we changed the aggregator 'hash' method to 'sort' method, using the 12 group (by) elements, but the amount and quantity totals now have been increased approx. 44-46%. The aggregations are using the 'sum of output' and the 'missing value = 0.0'. The grouping keys sequence matches the input sequence and the sort sequence.

The job has been tested with a sort stage prior to the aggregator and without. The other tests include using single node processing, using the hash, hash/sort, same, and default options in both the previous stage and the aggregator, along with changing the preserve partitioning to 'clear' and to 'default'. As some of the options are changed, the amount and quantity totals also change, but they are still at least 44% higher than the actual totals. The director also shows no warnings for the aggregator.

I was able to strip the input data down to 216 records for testing the 'summing' of the amount and quantity. Testing with both 'hash' and 'sort' methods, 216 records go in and 216 records come out of the aggregator, but the aggregator using 'hash' method has all the totals summing correctly, and the aggregator using the 'sort' method has the totals not summing correctly.

Any suggestions or ideas would be greatly appreciated!

Thank you in advance for your help and guidance.

Kim

Ps. This is DataStage 7.5.1.A on AIX.
Krazykoolrohit
Charter Member
Charter Member
Posts: 560
Joined: Wed Jul 13, 2005 5:36 am
Location: Ohio

Post by Krazykoolrohit »

Are you sorting the data on those 12 columns before feeding into aggregator?

from what i know, a sort option in aggregator assumes that the input data will be sorted on those columns.
KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

Post by KEvens »

Hi.

Yes, when testing the sort aggregation, we put a sort stage prior to the aggregator, sorting on the same 12 group elements.

Thanks!
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Try using the sort option in Aggregator without the sort stage (Not that the data has to be sorted, But Datastage inserts a tsort operator and sorts the data on the right columns). See if it makes any difference.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

Post by KEvens »

Hi.

I also ran a test without a sort stage, but with hash/sort selected in the input of the sort aggregator as well, based on the 12 group (by) elements. The target totals were 44% higher than the source totals.

Please continue to send suggestions as I will test them. We are thinking we may need to send this issue to IBM for their help, but I would like to exhaust all possibilities before doing so.

Thanks again for your help!
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Maybe you can put your OSH code in a post, that way people will be able to analyze your problem better.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

Post by KEvens »

Here is the OSH code. I hope it's not too much to post. This is a test job that takes a sequential input file into a copy, that will then take the file 2 paths. 1 path will go into a hash/aggregator, then xfm, then ora and the other into xfm, sort/aggregator, xfm, and ora. I then compare the totals in the target tables.

Thank you for taking a look. Please let me know if you see anything.

#### STAGE: xfmChgDataTypes.ToFinalTrans_Part
## Operator
same
## General options
[ident('xfmChgDataTypes.ToFinalTrans_Part')]
## Inputs
0< 'aggrallcols:ToFinalTrans.v'
## Outputs
0> [] 'aggrallcols:ToFinalTrans_Part.v'
;

#################################################################
#### STAGE: xfmChgDataTypes
## Operator
transform
## Operator options
-flag run
-name 'V108S1_TestSeqMDWBUDGFdwPayrollBudgetFact_xfmChgDataTypes'
-argvalue 'PROC_AUDIT_ID=[&PROC_AUDIT_ID]'

## General options
[ident('xfmChgDataTypes'); jobmon_ident('xfmChgDataTypes')]
## Inputs
0< 'aggrallcols:ToFinalTrans_Part.v'
## Outputs
0> [] 'xfmChgDataTypes:out_FdwPayrollBudgetFact.v'
;

#### STAGE: aggrallcols.ToAggregate_Part
## Operator
hash -key CALENDAR_ID -key Co_Derived -key CostCenter -cs -key PRCS_LVL_Derived -cs -key MO_OR_PP_NBR -key ORGANIZATION_ID -key ACCOUNT_ID -key BUDGET_ID -key JOB_ID -key POSTN_PAY_LVL_ID -key PAYROLL_TYPE_ID
## General options
[ident('aggrallcols.ToAggregate_Part')]
## Inputs
0< 'Transformer_179:ToAggregate.v'
## Outputs
0> [] 'Transformer_179:ToAggregate_Part.v'
;

#### STAGE: aggrallcols.ToAggregate_Sort
## Operator
tsort -key CALENDAR_ID -asc -key Co_Derived -asc -key CostCenter -cs -asc -key PRCS_LVL_Derived -cs -asc -key MO_OR_PP_NBR -asc -key ORGANIZATION_ID -asc -key ACCOUNT_ID -asc -key BUDGET_ID -asc -key JOB_ID -asc -key POSTN_PAY_LVL_ID -asc -key PAYROLL_TYPE_ID -asc -nonStable
## General options
[ident('aggrallcols.ToAggregate_Sort')]
## Inputs
0< 'Transformer_179:ToAggregate_Part.v'
## Outputs
0> [] 'Transformer_179:ToAggregate_Part_Sort.v'
;

#################################################################
#### STAGE: aggrallcols
## Operator
group
## Operator options
-sort
-key 'CALENDAR_ID'
-key 'Co_Derived'
-key 'CostCenter'
-key 'PRCS_LVL_Derived'
-key 'MO_OR_PP_NBR'
-key 'ORGANIZATION_ID'
-key 'ACCOUNT_ID'
-key 'BUDGET_ID'
-key 'JOB_ID'
-key 'POSTN_PAY_LVL_ID'
-key 'PAYROLL_TYPE_ID'
-reduce 'Hours'
-sum 'Hours'
-mval 0.0
-reduce 'Dollars'
-sum 'Dollars'
-mval 0.0
-reduce 'change_code'
-min 'change_code'

## General options
[ident('aggrallcols'); jobmon_ident('aggrallcols')]
## Inputs
0< 'Transformer_179:ToAggregate_Part_Sort.v'
## Outputs
0> [modify (
change_code:not_nullable int8=change_code;
keep
CALENDAR_ID,Co_Derived,CostCenter,PRCS_LVL_Derived,
MO_OR_PP_NBR,Hours,Dollars,ORGANIZATION_ID,
ACCOUNT_ID,BUDGET_ID,JOB_ID,POSTN_PAY_LVL_ID,
PAYROLL_TYPE_ID;
)] 'aggrallcols:ToFinalTrans.v'
;

#################################################################
#### STAGE: rej_FdwPayrollBudgetFact
## Operator
export
## Operator options
-schema record
{final_delim=end, delim=',', null_field='', quote=double}
(
CALENDAR_ID:decimal[20,0];
ORGANIZATION_ID:decimal[20,0];
JOB_ID:decimal[20,0];
POSTN_PAY_LVL_ID:decimal[20,0];
PAYROLL_TYPE_ID:decimal[20,0];
ACCOUNT_ID:decimal[20,0];
BUDGET_ID:decimal[20,0];
ACTIVE_RCRD_FLG:nullable string[1];
BDGT_AMT:nullable decimal[22,6];
BDGT_HRS_QTY:nullable decimal[22,6];
INSERT_PRCS_AUD_ID:nullable decimal[20,0];
LST_UPDT_PRCS_AUD_ID:nullable decimal[20,0];
SSU:nullable int32;
PRCS_LVL:nullable string[max=5];
DEPT:nullable string[max=5];
sqlcode:int32;
)
-file '[&_BASE_PATH]/budg/dbreject/rej_fdw_payroll_budget_fact.seq'
-overwrite
-rejects continue

## General options
[ident('rej_FdwPayrollBudgetFact'); jobmon_ident('rej_FdwPayrollBudgetFact')]
## Inputs
0< 'oraFdwPayrollBudgetFact:rej_FdwPayrollBudgetFact.v'
;

#### STAGE: oraFdwPayrollBudgetFact.out_FdwPayrollBudgetFact_Part
## Operator
same
## General options
[ident('oraFdwPayrollBudgetFact.out_FdwPayrollBudgetFact_Part')]
## Inputs
0< 'xfmChgDataTypes:out_FdwPayrollBudgetFact.v'
## Outputs
0> [] 'xfmChgDataTypes:out_FdwPayrollBudgetFact_Part.v'
;

#################################################################
#### STAGE: oraFdwPayrollBudgetFact
## Operator
oraupsert
## Operator options
-dboptions '{user=[&_MDW_ETL_USER],password=[&_MDW_ETL_PASSWORD]}'
-insert 'INSERT
INTO
[&_MDW_SCHEMA].[&pSnapshotTable]
(CALENDAR_ID, ORGANIZATION_ID, JOB_ID, POSTN_PAY_LVL_ID, PAYROLL_TYPE_ID, ACCOUNT_ID, BUDGET_ID, ACTIVE_RCRD_FLG, BDGT_AMT, BDGT_HRS_QTY, INSERT_PRCS_AUD_ID, LST_UPDT_PRCS_AUD_ID, SSU, PRCS_LVL, DEPT)
VALUES
(ORCHESTRATE.CALENDAR_ID, ORCHESTRATE.ORGANIZATION_ID, ORCHESTRATE.JOB_ID, ORCHESTRATE.POSTN_PAY_LVL_ID, ORCHESTRATE.PAYROLL_TYPE_ID, ORCHESTRATE.ACCOUNT_ID, ORCHESTRATE.BUDGET_ID, ORCHESTRATE.ACTIVE_RCRD_FLG, ORCHESTRATE.BDGT_AMT, ORCHESTRATE.BDGT_HRS_QTY, ORCHESTRATE.INSERT_PRCS_AUD_ID, ORCHESTRATE.LST_UPDT_PRCS_AUD_ID, ORCHESTRATE.SSU, ORCHESTRATE.PRCS_LVL, ORCHESTRATE.DEPT)'
-update 'UPDATE
[&_MDW_SCHEMA].[&pSnapshotTable]
SET
ACTIVE_RCRD_FLG = ORCHESTRATE.ACTIVE_RCRD_FLG, BDGT_AMT = ORCHESTRATE.BDGT_AMT, BDGT_HRS_QTY = ORCHESTRATE.BDGT_HRS_QTY, LST_UPDT_PRCS_AUD_ID = ORCHESTRATE.LST_UPDT_PRCS_AUD_ID, SSU = ORCHESTRATE.SSU, PRCS_LVL = ORCHESTRATE.PRCS_LVL, DEPT = ORCHESTRATE.DEPT
WHERE
(CALENDAR_ID = ORCHESTRATE.CALENDAR_ID AND ORGANIZATION_ID = ORCHESTRATE.ORGANIZATION_ID AND JOB_ID = ORCHESTRATE.JOB_ID AND POSTN_PAY_LVL_ID = ORCHESTRATE.POSTN_PAY_LVL_ID AND PAYROLL_TYPE_ID = ORCHESTRATE.PAYROLL_TYPE_ID AND ACCOUNT_ID = ORCHESTRATE.ACCOUNT_ID AND BUDGET_ID = ORCHESTRATE.BUDGET_ID)'
-update_first
-reject
-server '[&_MDW_ETL_DSN]'

## General options
[ident('oraFdwPayrollBudgetFact'); jobmon_ident('oraFdwPayrollBudgetFact')]
## Inputs
0< 'xfmChgDataTypes:out_FdwPayrollBudgetFact_Part.v'
## Outputs
0> [] 'oraFdwPayrollBudgetFact:rej_FdwPayrollBudgetFact.v'
;

#################################################################
#### STAGE: seqTestInputFile
## Operator
import
## Operator options
-schema record
{final_delim=end, delim=',', null_field='', quote=double}
(
CALENDAR_ID:decimal[20,0];
Co_Derived:int32;
CostCenter:string[max=15] {quote=none};
PRCS_LVL_Derived:string[max=5];
MO_OR_PP_NBR:int32;
ORGANIZATION_ID:decimal[20,0];
ACCOUNT_ID:decimal[20,0];
BUDGET_ID:decimal[20,0];
JOB_ID:decimal[20,0];
POSTN_PAY_LVL_ID:decimal[20,0];
PAYROLL_TYPE_ID:decimal[20,0];
Hours:decimal[19,6] {quote=none};
Dollars:decimal[22,6] {quote=none};
change_code:int8;
)
-file '[&_BASE_PATH]/budget/work/inputTestFile.csv'
-rejects continue
-reportProgress yes

## General options
[ident('seqTestInputFile'); jobmon_ident('seqTestInputFile')]
## Outputs
0> [] 'seqTestInputFile:outSort.v'
;

#################################################################
#### STAGE: Copy_173
## Operator
copy
## General options
[ident('Copy_173'); jobmon_ident('Copy_173')]
## Inputs
0< 'seqTestInputFile:outSort.v'
## Outputs
0> [modify (
keep
CALENDAR_ID,Co_Derived,CostCenter,PRCS_LVL_Derived,
MO_OR_PP_NBR,ORGANIZATION_ID,ACCOUNT_ID,BUDGET_ID,
JOB_ID,POSTN_PAY_LVL_ID,PAYROLL_TYPE_ID,Hours,
Dollars,change_code;
)] 'Copy_173:DSLink174.v'
1> [modify (
keep
CALENDAR_ID,Co_Derived,CostCenter,PRCS_LVL_Derived,
MO_OR_PP_NBR,ORGANIZATION_ID,ACCOUNT_ID,BUDGET_ID,
JOB_ID,POSTN_PAY_LVL_ID,PAYROLL_TYPE_ID,Hours,
Dollars,change_code;
)] 'Copy_173:ToAggregate2.v'
;

#### STAGE: Copy_of_aggrallcols.ToAggregate2_Part
## Operator
hash -key CALENDAR_ID -key Co_Derived -key CostCenter -cs -key PRCS_LVL_Derived -cs -key MO_OR_PP_NBR -key ORGANIZATION_ID -key ACCOUNT_ID -key BUDGET_ID -key JOB_ID -key POSTN_PAY_LVL_ID -key PAYROLL_TYPE_ID
## General options
[ident('Copy_of_aggrallcols.ToAggregate2_Part')]
## Inputs
0< 'Copy_173:ToAggregate2.v'
## Outputs
0> [] 'Copy_173:ToAggregate2_Part.v'
;

#################################################################
#### STAGE: Copy_of_aggrallcols
## Operator
group
## Operator options
-hash
-key 'CALENDAR_ID'
-key 'Co_Derived'
-key 'CostCenter'
-key 'PRCS_LVL_Derived'
-key 'MO_OR_PP_NBR'
-key 'ORGANIZATION_ID'
-key 'ACCOUNT_ID'
-key 'BUDGET_ID'
-key 'JOB_ID'
-key 'POSTN_PAY_LVL_ID'
-key 'PAYROLL_TYPE_ID'
-reduce 'Hours'
-sum 'Hours'
-mval 0.0
-reduce 'Dollars'
-sum 'Dollars'
-mval 0.0
-reduce 'change_code'
-min 'change_code'

## General options
[ident('Copy_of_aggrallcols'); jobmon_ident('Copy_of_aggrallcols')]
## Inputs
0< 'Copy_173:ToAggregate2_Part.v'
## Outputs
0> [modify (
change_code:not_nullable int8=change_code;
keep
CALENDAR_ID,Co_Derived,CostCenter,PRCS_LVL_Derived,
MO_OR_PP_NBR,Hours,Dollars,ORGANIZATION_ID,
ACCOUNT_ID,BUDGET_ID,JOB_ID,POSTN_PAY_LVL_ID,
PAYROLL_TYPE_ID;
)] 'Copy_of_aggrallcols:ToFinalTrans3.v'
;

#### STAGE: Copy_of_oraFdwPayrollBudgetFact.out_FdwPayrollBudgetFact3_Part
## Operator
same
## General options
[ident('Copy_of_oraFdwPayrollBudgetFact.out_FdwPayrollBudgetFact3_Part')]
## Inputs
0< 'Copy_of_xfmChgDataTypes:out_FdwPayrollBudgetFact3.v'
## Outputs
0> [] 'Copy_of_xfmChgDataTypes:out_FdwPayrollBudgetFact3_Part.v'
;

#################################################################
#### STAGE: Copy_of_oraFdwPayrollBudgetFact
## Operator
oraupsert
## Operator options
-dboptions '{user=[&_MDW_ETL_USER],password=[&_MDW_ETL_PASSWORD]}'
-insert 'INSERT
INTO
[&_MDW_SCHEMA].FDW_PAYROLL_BUDGET_FACT3
(CALENDAR_ID, ORGANIZATION_ID, JOB_ID, POSTN_PAY_LVL_ID, PAYROLL_TYPE_ID, ACCOUNT_ID, BUDGET_ID, ACTIVE_RCRD_FLG, BDGT_AMT, BDGT_HRS_QTY, INSERT_PRCS_AUD_ID, LST_UPDT_PRCS_AUD_ID, SSU, PRCS_LVL, DEPT)
VALUES
(ORCHESTRATE.CALENDAR_ID, ORCHESTRATE.ORGANIZATION_ID, ORCHESTRATE.JOB_ID, ORCHESTRATE.POSTN_PAY_LVL_ID, ORCHESTRATE.PAYROLL_TYPE_ID, ORCHESTRATE.ACCOUNT_ID, ORCHESTRATE.BUDGET_ID, ORCHESTRATE.ACTIVE_RCRD_FLG, ORCHESTRATE.BDGT_AMT, ORCHESTRATE.BDGT_HRS_QTY, ORCHESTRATE.INSERT_PRCS_AUD_ID, ORCHESTRATE.LST_UPDT_PRCS_AUD_ID, ORCHESTRATE.SSU, ORCHESTRATE.PRCS_LVL, ORCHESTRATE.DEPT)'
-update 'UPDATE
[&_MDW_SCHEMA].FDW_PAYROLL_BUDGET_FACT3
SET
ACTIVE_RCRD_FLG = ORCHESTRATE.ACTIVE_RCRD_FLG, BDGT_AMT = ORCHESTRATE.BDGT_AMT, BDGT_HRS_QTY = ORCHESTRATE.BDGT_HRS_QTY, LST_UPDT_PRCS_AUD_ID = ORCHESTRATE.LST_UPDT_PRCS_AUD_ID, SSU = ORCHESTRATE.SSU, PRCS_LVL = ORCHESTRATE.PRCS_LVL, DEPT = ORCHESTRATE.DEPT
WHERE
(CALENDAR_ID = ORCHESTRATE.CALENDAR_ID AND ORGANIZATION_ID = ORCHESTRATE.ORGANIZATION_ID AND JOB_ID = ORCHESTRATE.JOB_ID AND POSTN_PAY_LVL_ID = ORCHESTRATE.POSTN_PAY_LVL_ID AND PAYROLL_TYPE_ID = ORCHESTRATE.PAYROLL_TYPE_ID AND ACCOUNT_ID = ORCHESTRATE.ACCOUNT_ID AND BUDGET_ID = ORCHESTRATE.BUDGET_ID)'
-update_first
-reject
-server '[&_MDW_ETL_DSN]'

## General options
[ident('Copy_of_oraFdwPayrollBudgetFact'); jobmon_ident('Copy_of_oraFdwPayrollBudgetFact')]
## Inputs
0< 'Copy_of_xfmChgDataTypes:out_FdwPayrollBudgetFact3_Part.v'
## Outputs
0> [] 'Copy_of_oraFdwPayrollBudgetFact:rej_FdwPayrollBudgetFact.v'
;

#################################################################
#### STAGE: Copy_of_rej_FdwPayrollBudgetFact
## Operator
export
## Operator options
-schema record
{final_delim=end, delim=',', null_field='', quote=double}
(
CALENDAR_ID:decimal[20,0];
ORGANIZATION_ID:decimal[20,0];
JOB_ID:decimal[20,0];
POSTN_PAY_LVL_ID:decimal[20,0];
PAYROLL_TYPE_ID:decimal[20,0];
ACCOUNT_ID:decimal[20,0];
BUDGET_ID:decimal[20,0];
ACTIVE_RCRD_FLG:nullable string[1];
BDGT_AMT:nullable decimal[22,6];
BDGT_HRS_QTY:nullable decimal[22,6];
INSERT_PRCS_AUD_ID:nullable decimal[20,0];
LST_UPDT_PRCS_AUD_ID:nullable decimal[20,0];
SSU:nullable int32;
PRCS_LVL:nullable string[max=5];
DEPT:nullable string[max=5];
sqlcode:int32;
)
-file '[&_BASE_PATH]/budg/dbreject/rej_fdw_payroll_budget_fact.seq'
-overwrite
-rejects continue

## General options
[ident('Copy_of_rej_FdwPayrollBudgetFact'); jobmon_ident('Copy_of_rej_FdwPayrollBudgetFact')]
## Inputs
0< 'Copy_of_oraFdwPayrollBudgetFact:rej_FdwPayrollBudgetFact.v'
;

#### STAGE: Copy_of_xfmChgDataTypes.ToFinalTrans3_Part
## Operator
same
## General options
[ident('Copy_of_xfmChgDataTypes.ToFinalTrans3_Part')]
## Inputs
0< 'Copy_of_aggrallcols:ToFinalTrans3.v'
## Outputs
0> [] 'Copy_of_aggrallcols:ToFinalTrans3_Part.v'
;

#################################################################
#### STAGE: Copy_of_xfmChgDataTypes
## Operator
transform
## Operator options
-flag run
-name 'V115S5_TestSeqMDWBUDGFdwPayrollBudgetFact_Copy_of_xfmChgDataTypes'
-argvalue 'PROC_AUDIT_ID=[&PROC_AUDIT_ID]'

## General options
[ident('Copy_of_xfmChgDataTypes'); jobmon_ident('Copy_of_xfmChgDataTypes')]
## Inputs
0< 'Copy_of_aggrallcols:ToFinalTrans3_Part.v'
## Outputs
0> [] 'Copy_of_xfmChgDataTypes:out_FdwPayrollBudgetFact3.v'
;

#################################################################
#### STAGE: Transformer_179
## Operator
transform
## Operator options
-flag run
-name 'V0S179_TestSeqMDWBUDGFdwPayrollBudgetFact_Transformer_179'

## General options
[ident('Transformer_179'); jobmon_ident('Transformer_179')]
## Inputs
0< 'Copy_173:DSLink174.v'
## Outputs
0> [] 'Transformer_179:ToAggregate.v'
;
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

Hi Kim,

Have you made any progress with this issue yet?

I am seeing a similar problem on 7.5.1.A in AIX too, and am convinced that the aggregator has issues trying to sum values of type 'double'

I tried to prove/disprove this by converting to decimal, however, I still see my summed values changing on each run. Sorting and hashing on relevant keys is definitely correctly implemented.

EDIT: Our aggregation method is hash as opposed to sort

Cheers,

Mark
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Grouping by 12 columns? That is very highly likely to generate a complete set of unique records. Do your 216 records have any duplicates based on these 12 fields?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

Post by KEvens »

miwinter,

We were unable to use this specific logic going forward and we are still planning on sending it in to IBM to have them look into this aggregator 'sort' method issue.

We have worked around this issue by moving the pivot that we had prior to the aggregator to after the aggregator and changed the aggregator back to the 'hash' method.

In the transformer prior to the aggregator using the 'hash' method, we do have the preserve partitioning set to 'clear'. In the aggregator stage properties, we have the method set to 'hash' with the input partitioning tab partition type set to hash and the selected keys in the input hash match the order of the grouping keys in the stage properties.

I hope this helps.


ray.wurlod

With the 216 test records, there were no duplicates.

Thank you for any and all of your help!
Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's why you got 216 rows out.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

Post by KEvens »

Ray,

Yes, during the scaled down data test, we had 216 records come out, but the sums of the QTY and AMT fields were different when using the 'sort' method vs the 'hash' method on the aggregator.

The sums are correct when using the 'hash' method but the sums are almost 46% higher with the 'sort' method.

When using the full size amount of data, there were significant duplicates, but with the 'hash' method, we ran into the HEAP problem. By looking through DSXchange and other various publications, we found that if the 'hash' method ran into HEAP problems, then switch to the 'sort' method. With the 'sort' method, that fixed the HEAP problem, but the sums were not correct.

For example, here are the numbers after running the same 216 records through the aggregators (1 with 'sort' and 1 with 'hash'), then inserting into 2 separate tables for comparing.

'hash' method - QTY = 10425.25 AMOUNT = 136624.4
'sort' method - QTY = 22199.51 AMOUNT = 293109.35

I hope I am explaining the problem well enough. :?

Thanks again for your help.
Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Stick them in a spreadsheet and add them there. What results do you get?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
KEvens
Premium Member
Premium Member
Posts: 10
Joined: Wed Jul 14, 2004 9:10 am

Post by KEvens »

I get the same results as the 'hash' method when I add them up in a spreadsheet and when I add them up by hand. That's how I determined that the 'hash' method was working correctly, but the 'sort' method is not quite correct.

We also made changes to the partitioning and the nodes, etc. Unfortunately, we didn't come up with the correct combination to make it work correctly with the 'sort' method.

Thanks again. If you have any more ideas, please suggest. I am out of ideas to try. :(

Kim
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

Kim,

Out of interest, what is/are the datatype(s) of your sum columns which are causing different results to be output, depending on whether you use the hash or sort method of aggregation? Does any conversion of these attributes (datatype-wise) occur, prior to aggregation? What datatype do they begin as and what are they at the point of input to aggregation?

Cheers
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
Post Reply