Page 1 of 1

Sort Stage - (Don't Sort) Performance Impact

Posted: Mon Aug 26, 2019 9:55 am
by sensiva
Hello,

I would like to have your views on the usage of multiple sort stages for the use of creating a key change column and not sorting the data actually.

Here is the scenario,

Code: Select all

Sort Stage 1 - Sort Mode - Sort for all columns
Sorting columns A, B, C, D, E

Sort Stage 2 - Sort Mode - (Don't Sort Previously sorted)
Sorting columns A, B, C 
Create a key change column 1

Sort Stage 3 - Sort Mode - (Don't Sort Previously sorted)
Sorting columns A
Create a key change column 2
I did read from the knowledge center that Don't sort would not use much of memory, but still hesitant to use multiple sort stage for an input data that would probably contain around 3 million records. Is it advisable to use the sort stage just for creating key change columns, else would have do in transformer with comparing the previous records..

Any pointers would be of great help.

Thanks
Sen

Posted: Mon Aug 26, 2019 12:17 pm
by chulett
Sorry, it's been awhile but can't you sort and create the key change column at the same time? Meaning two rather than three Sort stages. And if the sort handles the key change, I'm not sure there's a need to have a transformer do it post-sort unless there are rules to it that you would need those stage variables to handle properly... seeing as how the data needs to be sorted regardless.

Regardless, I don't think you need to be too concerned about the performance impact of "Don't Sort" stages but curious what others think. And 3M isn't really a large amount to sort IMHO unless your infrastructure is not up to the task.

Posted: Tue Aug 27, 2019 3:43 am
by sensiva
Thanks for your reply
chulett wrote:Sorry, it's been awhile but can't you sort and create the key change column at the same time? Meaning two rather than three Sort stages.
Yes, Sort does create a keyChange while sorting, but i don't want the keyChange with 5 keys (A,B,C,D,E) but with rather one keyChange with (A,B,C) keys and another with just A as key.

Code: Select all

 Say for example A = COUNTRY, B = STATE, C = ORDER, D = PRODUCTS E = xxxx

I need to sort on all these keys to process the data and then i would need a key change till ORDER and another key change just for the COUNTRY to route and process them differently. 
I don't think you need to be too concerned about the performance impact of "Don't Sort" stages but curious what others think. And 3M isn't really a large amount to sort IMHO unless your infrastructure is not up to the task.
Our infrastructure is well built, and I could still ask for more cpu if need be. But would really like my design to be well made to put forth my points and demand them.

I would go ahead and implement with 3 sort stage with 2 of them needing just for keyChange.

And definetly as said, it would be great to have others views as well .

Thanks
Sen

Posted: Tue Aug 27, 2019 3:46 am
by sensiva
Thanks for your reply
chulett wrote:Sorry, it's been awhile but can't you sort and create the key change column at the same time? Meaning two rather than three Sort stages.
Yes, Sort does create a keyChange while sorting, but i don't want the keyChange with 5 keys (A,B,C,D,E) but with rather one keyChange with (A,B,C) keys and another with just A as key.

Code: Select all

 Say for example A = COUNTRY, B = STATE, C = ORDER, D = PRODUCTS E = xxxx

I need to sort on all these keys to process the data and then i would need a key change till ORDER and another key change just for the COUNTRY to route and process them differently. 
I don't think you need to be too concerned about the performance impact of "Don't Sort" stages but curious what others think. And 3M isn't really a large amount to sort IMHO unless your infrastructure is not up to the task.
Our infrastructure is well built, and I could still ask for more cpu if need be. But would really like my design to be well made to put forth my points and demand them.

I would go ahead and implement with 3 sort stage with 2 of them needing just for keyChange.

And definetly as said, it would be great to have others views as well .

Thanks
Sen

Posted: Mon Sep 02, 2019 10:50 pm
by Mike
I think I would go with 1 sort stage.

Partition by A
Sort by A,B,C,D,E
LastRowInGroup(C) transformer function will give you key breaks on A,B,C
LastRowInGroup(A) transformer function will give you key break on A

Partition by A since you likely want all of the A rows passing through the same processing node.

That's all from memory as I haven't used the LastRowInGroup() function for some time now.

Mike

Posted: Mon Sep 23, 2019 5:43 am
by sensiva
Thanks Mike, your solution worked great and just one sort stage was enough