Latest record to survive

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Latest record to survive

Post by hitmanthesilentassasin »

Hi,

I have a requirement to survive the most frequent non blank value, incase of tie the latest record based on a date value column should survive.

not able to find a way to pick the latest record value when its a tie. any suggestions ?

Thanks
prasannakumarkk
Participant
Posts: 117
Joined: Wed Feb 06, 2013 9:24 am
Location: Chennai,TN, India

Post by prasannakumarkk »

In a sort stage, make partitioning and inline sort based on key columns and have the date column partitioned as well as sorted and in the stage page give all the keys except date column. Set key change column to true. Filter the record for which key change column is 1.
Thanks,
Prasanna
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Thanks Mate, but my requirement is to survive the best record based on most frequent non blank values in case if it is a tie then the tie breaker should be the date. I was looking for the survive stage solutions. or should I restore to the age old transformer method using stage variables?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you can carry through a column such as LAST_UPDATE_TIMESTAMP you can use the highest value of that as your tie-breaker.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

I have already tried to define expressions like c.timestamp > c.timestamp and also like b.timestamp > b.timestamp but nothing fetches the desired results. the only options available is greater than but not like greatest in version 8.7

An alternate to this could be to sort the data based on the key column and the timestamp column and generate a string based on the previous values and the current values and in survivor stage select the data based on the longest string value.

I was hoping that there could be more clear way to define the greatest value.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Curious choices. Did you try c.timestamp > b.timestamp?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Yes I did. Its strange the functionality is limited to greater than and longest length of the string and not like highest, lowest or median values
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not really. Keep in mind what a survive rule is doing; it's specifying a condition under which a field in the currect (c) record will replace that field in the so-far-the-best (b) record. Median couldn't easily be supported, particularly when many steps are involved in the survive rule, since the median isn't available till all records in the block have been processed, but the comparisons are done one record at a time (the b record is initialized from the master).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

I think I have got the solution. Run the job in sequential mode and sort the data based on the key and the timestamp column so that that latest record is processed as the last record. whenever there is a tie survive stage seems to pick the value from the last record(can't bet my life on this ) but this is what a quick test has shown.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You don't need sequential mode if you partition on the key and sort on the key and timestamp column. The partitioning will keep groups of records on the same node.
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Agree - the job can retain parallel mode with the appropriate partitioning. I was mentioning the testing steps that yielded the results :D
Post Reply