Removing both duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
hobocamp
Premium Member
Premium Member
Posts: 98
Joined: Thu Aug 31, 2006 10:04 am

Removing both duplicates

Post by hobocamp »

I have a requirement in which if a duplicate value occurs in a particular field, I need to drop both (or all) of the records with that value.

I'm guessing there may be a sql statement that could be devised to handle this, but I'd like to do it without that if possible. Anyone have suggestions?

Thanks in advance.

Tom
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Sort the data with cluster change attribute enabled.

Filter only where change is 0 and use it to lookup and ignore duplicates.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

How about sorting the data on your key criteria, adding a cluster key change column (1 for first duplicate, 0 for subsequent ones), then sort again but add the change column, then a transform stage using a stage variable to store previous record key data. If change column = 1 and previous record is change column 0 and has the same key, drop the record. Drop the record if the change column is 0 as well.
singhald
Participant
Posts: 180
Joined: Tue Aug 23, 2005 2:50 am
Location: Bangalore
Contact:

Post by singhald »

Hi Tom,

Welcome on board.

I have implemented this solution before. I will suggest you to use two sort stages and one transformer stage to remove both instance incase of duplication.


1st Sort Stage: you need to sort the data on keys on which you are identifying duplicate records & enable KEY CHANGE column in this sort stage.

2nd Sort Stage: use the same fields on which you sorted the data and keep dont sort previously sorted data and sort only KeyChange Field in ascending order.

Transformer Stage: use two stage variables:
1stStage variable--> DupRec:=>If keyChange =0 Or (keyChange=1 And PreKeyChange = 0) Then "DUPE" Else "UNIQUE"

2ndStageVar--->PreKeyChange =>Map it to keychange field

In your transformer Stage, you can put the variable DUPE in to constraint to get all duplicate records and "UNIQUE" to get all unique records.

Let me know if you face any problem.
Regards,
Deepak Singhal
Everything is okay in the end. If it's not okay, then it's not the end.
betterthanever
Participant
Posts: 152
Joined: Tue Jan 13, 2009 8:59 am

Re: Removing both duplicates

Post by betterthanever »

hobocamp wrote:I have a requirement in which if a duplicate value occurs in a particular field, I need to drop both (or all) of the records with that value.

I'm guessing there may be a sql statement that could be devised to handle this, but I'd like to do it without that if possible. Anyone have suggestions?

Thanks in advance.

Tom
Post Reply