How to find duplicates in a sequential file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Rajesekhar
Participant
Posts: 19
Joined: Fri May 14, 2010 3:54 pm

How to find duplicates in a sequential file

Post by Rajesekhar »

Hi I have a CSV file. I need to find if there are any duplicates in the files. If any duplicate records are present I need to reject them and put into a sequential file.

Can any body let me know hoe can i do this in DataStage8.1 version?

Thanks, Raj.
Rajesekhar Potteti
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

You will have to do the following:

1) Sort your input on the key fields.

2) Use a transformer with stage variables to compare each record with the value of the previous record (on whatever key fields you need to check).

3) Setup constraints on the transformer's output links to restrict record flow so that only records where all key fields match the previous record are output to the reject (and only the reject) link.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Rajesekhar
Participant
Posts: 19
Joined: Fri May 14, 2010 3:54 pm

Post by Rajesekhar »

Thanks for answering.

Sorry for being stupid.

Could you tell me how can we compare each record with the value of the previous record in Tx stage?
Rajesekhar Potteti
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

You would need to use StageVariables to do what Andy suggested.

Another way to do is to use a Sort stage and set Create Key Change Column Property to True and Allow duplicates property to True. Then after the Sort stage, put a filter stage and filter on the key change column to grab the duplicates and non duplicates ( key change column =0 and key change column =1) into the links you want.

Hope that helps.
Kris

Where's the "Any" key?-Homer Simpson
siauchun84
Participant
Posts: 63
Joined: Mon Oct 20, 2008 12:01 am
Location: Malaysia

Post by siauchun84 »

That will be easier by turning on the "Create Key Change Column" in the Sort stage by set it to True. Those duplicate value will get 1 in the keyChange column. Use the transformer to filter out then.
Rajesekhar
Participant
Posts: 19
Joined: Fri May 14, 2010 3:54 pm

Post by Rajesekhar »

Hi All,

Thanks for the suggestions you gave to resolve this.

I followed three approches to make sure all are giving same results. And All gave same results :)

1. Using Sort and Filter stage.
In sort stage I had set Create Key Change column to True.
In Filter stage using KeyChange attribute i routed unique and duplicate values to two diffrent seq files.

2. Using sort Transformer Stage
In sort stage I had set Create Key Change column to True.
In Transformer stage I gave constraints if KeyChange is 0 then to one out Link and if KeyChange is 1 then to another out link.

3. Using sort Transformer Stage
In Transformer stage using StageVaribales I comapred current and previous records accordingly I routed messages to two out links.

All three approaches worked very well.

Thanks all for your help!!
Rajesekhar P
Rajesekhar Potteti
Rajesekhar
Participant
Posts: 19
Joined: Fri May 14, 2010 3:54 pm

Post by Rajesekhar »

Hi All,

Thanks for the suggestions you gave to resolve this.

I followed three approches to make sure all are giving same results. And All gave same results :)

1. Using Sort and Filter stage.
In sort stage I had set Create Key Change column to True.
In Filter stage using KeyChange attribute i routed unique and duplicate values to two diffrent seq files.

2. Using sort Transformer Stage
In sort stage I had set Create Key Change column to True.
In Transformer stage I gave constraints if KeyChange is 0 then to one out Link and if KeyChange is 1 then to another out link.

3. Using sort Transformer Stage
In Transformer stage using StageVaribales I comapred current and previous records accordingly I routed messages to two out links.

All three approaches worked very well.

Thanks all for your help!!
Rajesekhar P
Rajesekhar Potteti
Post Reply