About tsort operator on sorted data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mfecdsx
Premium Member
Premium Member
Posts: 11
Joined: Thu Aug 02, 2007 2:35 am

About tsort operator on sorted data

Post by mfecdsx »

If data have been sorted (Sort Stage) and then DataStage auto inserted tsort operator (Join Stage - same key as Sort Stage), is it actually re-sort data again?

I've created test job like this

Seq_1 ---> Sort_1 ---> Copy_1 ---> Join_Stage --->Copy_3
Seq_2 ---> Sort_2 ---> Copy_2 ---^

Some of job score:
main_program: This step has 7 datasets:
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op2[2p] (parallel APT_CombinedOperatorController(0):Sort_1)}
ds1: {op1[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op3[2p] (parallel APT_CombinedOperatorController(1):Sort_2)}
ds2: {op2[2p] (parallel APT_CombinedOperatorController(0):Copy_1)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op4[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)}
ds3: {op3[2p] (parallel APT_CombinedOperatorController(1):Copy_2)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op5[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)}
ds4: {op4[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)
[pp] eSame=>eCollectAny
op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds5: {op5[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)
[pp] eSame=>eCollectAny
op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds6: {op6[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)
eAny=>eCollectAny
op7[2p] (parallel Copy_41)}
Notict that there are some auto inserted tsort operator at Join Stage although data already sorted.
My question: At run time will data be re-sort again or not?
battaliou
Participant
Posts: 155
Joined: Mon Feb 24, 2003 7:28 am
Location: London
Contact:

Post by battaliou »

This would appear to be the case. Insert into the job the APT_NO_SORT_INSERTION environment variable as "TRUE", and try again.
3NF: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key. So help me Codd.
mfecdsx
Premium Member
Premium Member
Posts: 11
Joined: Thu Aug 02, 2007 2:35 am

Post by mfecdsx »

It's my intention to have auto inserted tsort operator and auto partitioning just for test this case.
The question stay the same, is datastage smart enough to check that data is already sorted on desired field and do nothing or it have to re-sort all data again?
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

DataStage cannot tell that data is sorted if it is sorted outside of DataStage. By default it assumes data is not sorted. If there is a stage in the job that requires sorted data such as Remove Duplicates or Join then DataStage will automatically add a tsort before that stage.

There are only two ways to prevent these tsorts from being added:
- Adding APT_NO_SORT_INSERT to the job, which can be dangerous if the data is not sorted or incorrectly sorted.
- Adding a sort stage to the job and setting an option that the data is already sorted and should not be sorted again.

If you do not do either of these things then you cannot avoid the tsorts.
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Hi Vincent,

but the thing is, that the job already contains Sort-Stages - according to the job description and also according to the job score (op2 and op3).

@mfexdsx:

What I find suspicious is the difference between the APT_HashPartitioner-definitions on ds0 and ds1. In case of ds1 the subArgument {asc} is specified explicitly in the score, which should, of course, be the default anyway. Maybe this is the reason why DataStage believes there to be a difference in partitioning, which then leads to the sort-insertion.

Could you try to remove the sort-order-property on ds1 or set it identically on ds0?
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
mfecdsx
Premium Member
Premium Member
Posts: 11
Joined: Thu Aug 02, 2007 2:35 am

Post by mfecdsx »

@BI_RMA

I think ds0 and ds 1 has the same definitions ( seq file stage -> sort stage)
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op2[2p] (parallel APT_CombinedOperatorController(0):Sort_1)}
ds1: {op1[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
Do you mean ds2 and ds3 (copy stage) which don't have subArgs={ asc } from previous ds?

And this is new job score with APT_DISABLE_COMBINATION = True
main_program: This step has 9 datasets:
ds0: {op0[1p] (sequential Sequential_File_1)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op1[2p] (parallel Sort_1)}
ds1: {op1[2p] (parallel Sort_1)
[pp] eSame=>eCollectAny
op4[2p] (parallel Copy_1)}
ds2: {op2[1p] (sequential Sequential_File_2)
eOther(APT_HashPartitioner { key={ value=KeyCol,
subArgs={ asc }
}
})<>eCollectAny
op3[2p] (parallel Sort_2)}
ds3: {op3[2p] (parallel Sort_2)
[pp] eSame=>eCollectAny
op5[2p] (parallel Copy_2)}
ds4: {op4[2p] (parallel Copy_1)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op6[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)}
ds5: {op5[2p] (parallel Copy_2)
eOther(APT_HashPartitioner { key={ value=KeyCol }
})#>eCollectAny
op7[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)}
ds6: {op6[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(0) in Join_Stage)
[pp] eSame=>eCollectAny
op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds7: {op7[2p] (parallel inserted tsort operator {key={value=KeyCol, subArgs={asc, cs}}}(1) in Join_Stage)
[pp] eSame=>eCollectAny
op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)}
ds8: {op8[2p] (parallel APT_JoinSubOperatorNC in Join_Stage)
eAny=>eCollectAny
op9[2p] (parallel Copy_41)}
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Hi mfecdsx,

sorry, I thought I had seen a difference in the job-score you posted on the sort-order of the hash-partitioner, but doublechecking I can't see it now. :?

Concerning the environment-variable: Vincent advised to set APT_NO_SORT_INSERTION, not APT_DISABLE_COMBINATION. If You do not use the Copy-Stage to drop any input-columns, DataStage would usually ignore the stages entirely at compile-time. In any case it can combine the copy-operator with the upcoming join-step. By setting APT_DISABLE_COMBINATION You do not allow the system to do that, but DataStage still inserts the tsort-operator.

There is another option: Set APT_NO_SORT_INSERTION_CHECK_ONLY. This prevents DataStage from inserting a tsort-operator, but the job will abort if data should arrive at the join-stage in incorrect order.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
mfecdsx
Premium Member
Premium Member
Posts: 11
Joined: Thu Aug 02, 2007 2:35 am

Post by mfecdsx »

I understand that if I set APT_NO_SORT_INSERTION to True would solve this problem. But actually this is not my concern.

The point is I want to know if I leave job design as default (auto partitioning,auto insert sort), the job performance would be as good as manual partitioning and sorting at every stage or not.
Post Reply