About tsort operator on sorted data

mfecdsx · Post by **mfecdsx** » Wed Jan 22, 2014 1:39 am

If data have been sorted (Sort Stage) and then DataStage auto inserted tsort operator (Join Stage - same key as Sort Stage), is it actually re-sort data again?

I've created test job like this

Seq_1 ---> Sort_1 ---> Copy_1 ---> Join_Stage --->Copy_3
Seq_2 ---> Sort_2 ---> Copy_2 ---^

Some of job score:

main_program: This step has 7 datasets:
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op2[2p] (parallel APT_CombinedOperatorController(0):Sort_1)}
ds1: {op1[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op3[2p] (parallel APT_CombinedOperatorController(1):Sort_2)}
ds2: {op2[2p] (parallel APT_CombinedOperatorController(0):Copy_1)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op4[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)}
ds3: {op3[2p] (parallel APT_CombinedOperatorController(1):Copy_2)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op5[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)}
ds4: {op4[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)
[pp] eSame=>eCollectAny
op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds5: {op5[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)
[pp] eSame=>eCollectAny
op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds6: {op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)
eAny=>eCollectAny
op7[2p] (parallel Copy_41)}

Notict that there are some auto inserted tsort operator at Join Stage although data already sorted.
My question: At run time will data be re-sort again or not?

battaliou · Post by **battaliou** » Wed Jan 22, 2014 6:07 am

This would appear to be the case. Insert into the job the APT_NO_SORT_INSERTION environment variable as "TRUE", and try again.

mfecdsx · Post by **mfecdsx** » Wed Jan 22, 2014 10:17 pm

It's my intention to have auto inserted tsort operator and auto partitioning just for test this case.
The question stay the same, is datastage smart enough to check that data is already sorted on desired field and do nothing or it have to re-sort all data again?

vmcburney · Post by **vmcburney** » Wed Jan 22, 2014 11:34 pm

DataStage cannot tell that data is sorted if it is sorted outside of DataStage. By default it assumes data is not sorted. If there is a stage in the job that requires sorted data such as Remove Duplicates or Join then DataStage will automatically add a tsort before that stage.

There are only two ways to prevent these tsorts from being added:
- Adding APT_NO_SORT_INSERT to the job, which can be dangerous if the data is not sorted or incorrectly sorted.
- Adding a sort stage to the job and setting an option that the data is already sorted and should not be sorted again.

If you do not do either of these things then you cannot avoid the tsorts.

BI-RMA · Post by **BI-RMA** » Thu Jan 23, 2014 6:54 am

Hi Vincent,

but the thing is, that the job already contains Sort-Stages - according to the job description and also according to the job score (op2 and op3).

@mfexdsx:

What I find suspicious is the difference between the APT_HashPartitioner-definitions on ds0 and ds1. In case of ds1 the subArgument {asc} is specified explicitly in the score, which should, of course, be the default anyway. Maybe this is the reason why DataStage believes there to be a difference in partitioning, which then leads to the sort-insertion.

Could you try to remove the sort-order-property on ds1 or set it identically on ds0?

mfecdsx · Post by **mfecdsx** » Thu Jan 23, 2014 10:02 pm

@BI_RMA

I think ds0 and ds 1 has the same definitions ( seq file stage -> sort stage)

ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op2[2p] (parallel APT_CombinedOperatorController(0):Sort_1)}
ds1: {op1[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}

Do you mean ds2 and ds3 (copy stage) which don't have subArgs={ asc } from previous ds?

And this is new job score with APT_DISABLE_COMBINATION = True

main_program: This step has 9 datasets:
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op1[2p] (parallel Sort_1)}
ds1: {op1[2p] (parallel Sort_1)
[pp] eSame=>eCollectAny
op4[2p] (parallel Copy_1)}
ds2: {op2[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op3[2p] (parallel Sort_2)}
ds3: {op3[2p] (parallel Sort_2)
[pp] eSame=>eCollectAny
op5[2p] (parallel Copy_2)}
ds4: {op4[2p] (parallel Copy_1)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op6[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)}
ds5: {op5[2p] (parallel Copy_2)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op7[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)}
ds6: {op6[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)
[pp] eSame=>eCollectAny
op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds7: {op7[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)
[pp] eSame=>eCollectAny
op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds8: {op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)
eAny=>eCollectAny
op9[2p] (parallel Copy_41)}

BI-RMA · Post by **BI-RMA** » Fri Jan 24, 2014 3:52 am

Hi mfecdsx,

sorry, I thought I had seen a difference in the job-score you posted on the sort-order of the hash-partitioner, but doublechecking I can't see it now.

Concerning the environment-variable: Vincent advised to set APT_NO_SORT_INSERTION, not APT_DISABLE_COMBINATION. If You do not use the Copy-Stage to drop any input-columns, DataStage would usually ignore the stages entirely at compile-time. In any case it can combine the copy-operator with the upcoming join-step. By setting APT_DISABLE_COMBINATION You do not allow the system to do that, but DataStage still inserts the tsort-operator.

There is another option: Set APT_NO_SORT_INSERTION_CHECK_ONLY. This prevents DataStage from inserting a tsort-operator, but the job will abort if data should arrive at the join-stage in incorrect order.

mfecdsx · Post by **mfecdsx** » Fri Jan 24, 2014 4:16 am

I understand that if I set APT_NO_SORT_INSERTION to True would solve this problem. But actually this is not my concern.

The point is I want to know if I leave job design as default (auto partitioning,auto insert sort), the job performance would be as good as manual partitioning and sorting at every stage or not.