Duplicate records to Error

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Gius
Participant
Posts: 29
Joined: Mon Mar 09, 2015 2:10 am

Duplicate records to Error

Post by Gius »

Hello,
I 'm reading a file, where the fields , Person Identifier and the Item Identifier, are keys.
I would like to send to Error (for example an output file), all the duplicates records and just keep one (max load date). But don't know really how to do that , could you help me please ?
Thank You -
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:?: Okay, what have you tried and what issues are you facing?

There is the Remove Duplicates stage which can retain one record per "key" checked for duplicates as long as they are sorted properly. Don't know if it supports any kind of a reject link or not, didn't see it during a quick check of the docs. Failing that... perhaps a mechanism to take the deduped result and join it back to the original source to get your "all the duplicates" for your error file?

Let's see what other folks think who may actually have had to do something like this.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Post by rumu »

In a similar kind of requirement, I used Sort stage. Sort your data based on

Person Identifier, the Item Identifier and load date(descending). Use Create key change column, this will assign 1 for the first record in the group and for remaining 0. Use transformer stage to output 2 links one for Records with 1 and other for records with 0.The last link will be your error output.
Rumu
IT Consultant
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's another approach but may not satisfy the "all duplicates" requirement... which I assume also includes the one to keep. So, if there is only one occurrence of a key, there's no error. However, if there are 12 occurrences, than I'm thinking they want one specific one (max load date) to be kept but all 12 should go to the error file rather than 11. Really depends on what "all the duplicate records" means in the OP. TBH, I may be over-thinking things. :wink:

Gius, can you clarify please?
-craig

"You can never have too many knives" -- Logan Nine Fingers
Gius
Participant
Posts: 29
Joined: Mon Mar 09, 2015 2:10 am

Post by Gius »

yes thank you, the sort with create key change column worked fine.
Another solution was to do a Loop Condition within the Transform.

The requirement was if 12 records dup, 11 to error and 1 (max Date) to output.
If I had to send the 12 to Error, I think I would do a join after , between the unique key of the 11 Error rec with all records.
Post Reply