Removing both duplicates

hobocamp · Post by **hobocamp** » Thu Sep 10, 2009 7:02 am

I have a requirement in which if a duplicate value occurs in a particular field, I need to drop both (or all) of the records with that value.

I'm guessing there may be a sql statement that could be devised to handle this, but I'd like to do it without that if possible. Anyone have suggestions?

Thanks in advance.

Tom

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Sep 10, 2009 7:22 am

Sort the data with cluster change attribute enabled.

Filter only where change is 0 and use it to lookup and ignore duplicates.

ArndW · Post by **ArndW** » Thu Sep 10, 2009 7:24 am

How about sorting the data on your key criteria, adding a cluster key change column (1 for first duplicate, 0 for subsequent ones), then sort again but add the change column, then a transform stage using a stage variable to store previous record key data. If change column = 1 and previous record is change column 0 and has the same key, drop the record. Drop the record if the change column is 0 as well.

singhald · Post by **singhald** » Thu Sep 10, 2009 9:04 am

Hi Tom,

Welcome on board.

I have implemented this solution before. I will suggest you to use two sort stages and one transformer stage to remove both instance incase of duplication.

1st Sort Stage: you need to sort the data on keys on which you are identifying duplicate records & enable KEY CHANGE column in this sort stage.

2nd Sort Stage: use the same fields on which you sorted the data and keep dont sort previously sorted data and sort only KeyChange Field in ascending order.

Transformer Stage: use two stage variables:
1stStage variable--> DupRec:=>If keyChange =0 Or (keyChange=1 And PreKeyChange = 0) Then "DUPE" Else "UNIQUE"

2ndStageVar--->PreKeyChange =>Map it to keychange field

In your transformer Stage, you can put the variable DUPE in to constraint to get all duplicate records and "UNIQUE" to get all unique records.

Let me know if you face any problem.

betterthanever · Post by **betterthanever** » Thu Sep 10, 2009 10:43 am

hobocamp wrote:I have a requirement in which if a duplicate value occurs in a particular field, I need to drop both (or all) of the records with that value.

I'm guessing there may be a sql statement that could be devised to handle this, but I'd like to do it without that if possible. Anyone have suggestions?

Thanks in advance.

Tom

DSXchange

Removing both duplicates

Removing both duplicates

Re: Removing both duplicates