Page 1 of 1

About tsort operator on sorted data

Posted: Wed Jan 22, 2014 1:39 am
by mfecdsx
If data have been sorted (Sort Stage) and then DataStage auto inserted tsort operator (Join Stage - same key as Sort Stage), is it actually re-sort data again?

I've created test job like this

Seq_1 ---> Sort_1 ---> Copy_1 ---> Join_Stage --->Copy_3
Seq_2 ---> Sort_2 ---> Copy_2 ---^

Some of job score:
main_program: This step has 7 datasets:
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op2[2p] (parallel APT_CombinedOperatorController(0):Sort_1)}
ds1: {op1[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op3[2p] (parallel APT_CombinedOperatorController(1):Sort_2)}
ds2: {op2[2p] (parallel APT_CombinedOperatorController(0):Copy_1)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op4[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)}
ds3: {op3[2p] (parallel APT_CombinedOperatorController(1):Copy_2)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op5[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)}
ds4: {op4[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)
[pp] eSame=>eCollectAny
op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds5: {op5[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)
[pp] eSame=>eCollectAny
op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds6: {op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)
eAny=>eCollectAny
op7[2p] (parallel Copy_41)}
Notict that there are some auto inserted tsort operator at Join Stage although data already sorted.
My question: At run time will data be re-sort again or not?

Posted: Wed Jan 22, 2014 6:07 am
by battaliou
This would appear to be the case. Insert into the job the APT_NO_SORT_INSERTION environment variable as "TRUE", and try again.

Posted: Wed Jan 22, 2014 10:17 pm
by mfecdsx
It's my intention to have auto inserted tsort operator and auto partitioning just for test this case.
The question stay the same, is datastage smart enough to check that data is already sorted on desired field and do nothing or it have to re-sort all data again?

Posted: Wed Jan 22, 2014 11:34 pm
by vmcburney
DataStage cannot tell that data is sorted if it is sorted outside of DataStage. By default it assumes data is not sorted. If there is a stage in the job that requires sorted data such as Remove Duplicates or Join then DataStage will automatically add a tsort before that stage.

There are only two ways to prevent these tsorts from being added:
- Adding APT_NO_SORT_INSERT to the job, which can be dangerous if the data is not sorted or incorrectly sorted.
- Adding a sort stage to the job and setting an option that the data is already sorted and should not be sorted again.

If you do not do either of these things then you cannot avoid the tsorts.

Posted: Thu Jan 23, 2014 6:54 am
by BI-RMA
Hi Vincent,

but the thing is, that the job already contains Sort-Stages - according to the job description and also according to the job score (op2 and op3).

@mfexdsx:

What I find suspicious is the difference between the APT_HashPartitioner-definitions on ds0 and ds1. In case of ds1 the subArgument {asc} is specified explicitly in the score, which should, of course, be the default anyway. Maybe this is the reason why DataStage believes there to be a difference in partitioning, which then leads to the sort-insertion.

Could you try to remove the sort-order-property on ds1 or set it identically on ds0?

Posted: Thu Jan 23, 2014 10:02 pm
by mfecdsx
@BI_RMA

I think ds0 and ds 1 has the same definitions ( seq file stage -> sort stage)
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op2[2p] (parallel APT_CombinedOperatorController(0):Sort_1)}
ds1: {op1[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
Do you mean ds2 and ds3 (copy stage) which don't have subArgs={ asc } from previous ds?

And this is new job score with APT_DISABLE_COMBINATION = True
main_program: This step has 9 datasets:
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op1[2p] (parallel Sort_1)}
ds1: {op1[2p] (parallel Sort_1)
[pp] eSame=>eCollectAny
op4[2p] (parallel Copy_1)}
ds2: {op2[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op3[2p] (parallel Sort_2)}
ds3: {op3[2p] (parallel Sort_2)
[pp] eSame=>eCollectAny
op5[2p] (parallel Copy_2)}
ds4: {op4[2p] (parallel Copy_1)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op6[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)}
ds5: {op5[2p] (parallel Copy_2)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op7[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)}
ds6: {op6[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)
[pp] eSame=>eCollectAny
op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds7: {op7[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)
[pp] eSame=>eCollectAny
op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds8: {op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)
eAny=>eCollectAny
op9[2p] (parallel Copy_41)}

Posted: Fri Jan 24, 2014 3:52 am
by BI-RMA
Hi mfecdsx,

sorry, I thought I had seen a difference in the job-score you posted on the sort-order of the hash-partitioner, but doublechecking I can't see it now. :?

Concerning the environment-variable: Vincent advised to set APT_NO_SORT_INSERTION, not APT_DISABLE_COMBINATION. If You do not use the Copy-Stage to drop any input-columns, DataStage would usually ignore the stages entirely at compile-time. In any case it can combine the copy-operator with the upcoming join-step. By setting APT_DISABLE_COMBINATION You do not allow the system to do that, but DataStage still inserts the tsort-operator.

There is another option: Set APT_NO_SORT_INSERTION_CHECK_ONLY. This prevents DataStage from inserting a tsort-operator, but the job will abort if data should arrive at the join-stage in incorrect order.

Posted: Fri Jan 24, 2014 4:16 am
by mfecdsx
I understand that if I set APT_NO_SORT_INSERTION to True would solve this problem. But actually this is not my concern.

The point is I want to know if I leave job design as default (auto partitioning,auto insert sort), the job performance would be as good as manual partitioning and sorting at every stage or not.