Page 1 of 1

Latest record to survive

Posted: Fri May 31, 2013 1:58 am
by hitmanthesilentassasin
Hi,

I have a requirement to survive the most frequent non blank value, incase of tie the latest record based on a date value column should survive.

not able to find a way to pick the latest record value when its a tie. any suggestions ?

Thanks

Posted: Fri May 31, 2013 3:03 am
by prasannakumarkk
In a sort stage, make partitioning and inline sort based on key columns and have the date column partitioned as well as sorted and in the stage page give all the keys except date column. Set key change column to true. Filter the record for which key change column is 1.

Posted: Fri May 31, 2013 3:19 am
by hitmanthesilentassasin
Thanks Mate, but my requirement is to survive the best record based on most frequent non blank values in case if it is a tie then the tie breaker should be the date. I was looking for the survive stage solutions. or should I restore to the age old transformer method using stage variables?

Posted: Fri May 31, 2013 4:15 pm
by ray.wurlod
If you can carry through a column such as LAST_UPDATE_TIMESTAMP you can use the highest value of that as your tie-breaker.

Posted: Sun Jun 02, 2013 5:38 pm
by hitmanthesilentassasin
I have already tried to define expressions like c.timestamp > c.timestamp and also like b.timestamp > b.timestamp but nothing fetches the desired results. the only options available is greater than but not like greatest in version 8.7

An alternate to this could be to sort the data based on the key column and the timestamp column and generate a string based on the previous values and the current values and in survivor stage select the data based on the longest string value.

I was hoping that there could be more clear way to define the greatest value.

Posted: Sun Jun 02, 2013 8:19 pm
by ray.wurlod
Curious choices. Did you try c.timestamp > b.timestamp?

Posted: Sun Jun 02, 2013 8:35 pm
by hitmanthesilentassasin
Yes I did. Its strange the functionality is limited to greater than and longest length of the string and not like highest, lowest or median values

Posted: Sun Jun 02, 2013 9:01 pm
by ray.wurlod
Not really. Keep in mind what a survive rule is doing; it's specifying a condition under which a field in the currect (c) record will replace that field in the so-far-the-best (b) record. Median couldn't easily be supported, particularly when many steps are involved in the survive rule, since the median isn't available till all records in the block have been processed, but the comparisons are done one record at a time (the b record is initialized from the master).

Posted: Sun Jun 02, 2013 9:08 pm
by hitmanthesilentassasin
I think I have got the solution. Run the job in sequential mode and sort the data based on the key and the timestamp column so that that latest record is processed as the last record. whenever there is a tie survive stage seems to pick the value from the last record(can't bet my life on this ) but this is what a quick test has shown.

Posted: Sun Jun 02, 2013 10:14 pm
by vmcburney
You don't need sequential mode if you partition on the key and sort on the key and timestamp column. The partitioning will keep groups of records on the same node.

Posted: Sun Jun 02, 2013 11:38 pm
by hitmanthesilentassasin
Agree - the job can retain parallel mode with the appropriate partitioning. I was mentioning the testing steps that yielded the results :D