how to remove the duplicate records

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Prithivi
Participant
Posts: 7
Joined: Sun Oct 10, 2004 12:51 am
Contact:

how to remove the duplicate records

Post by Prithivi »

Hi,

How can i remove the duplicate records by using a sequential stage componenet.
my flow is like this...

Sequential stage--->transformer--->sequential stage/Oracle OCI


suppose i am using a flat file as a source and it has having some duplicate records.I need to remove those duplicate records in tranformer stage and want to insert the clean records into the target file or target table.

I need your help .Please give me some idea how to comeout from this problem.Its very urgent to me.


Regards
Prithivi
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.

If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can do a sort with stage variable check or use agg if the data volume is low.

There are many other ways and everything depends on detailed analysis of what you are doing and what you wish to achieve.
Prithivi
Participant
Posts: 7
Joined: Sun Oct 10, 2004 12:51 am
Contact:

Post by Prithivi »

ArndW wrote:Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.

If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.

prithivi-- Can u tell me briefly.I have used the sort stage and getting the data in sorted order.Then after that how can i check the duplicate records through the stage variable.

need more infomation about it.

Prithivi
amsh76
Charter Member
Charter Member
Posts: 118
Joined: Wed Mar 10, 2004 10:58 pm

Post by amsh76 »

If your volume is not that high, you can always write the records to HF, but make sure you sort them before writing.

HF will remove the duplicates for you.
kris
Participant
Posts: 160
Joined: Tue Dec 09, 2003 2:45 pm
Location: virginia, usa

Post by kris »

You already have needful information above. Are you trying to do what you are trying to do? Or you want someone to do it for you ?

Here is one solution:

Use filter command(sort command) in sequential file stage.

IN SEQFileStage-------->Xfm-------->OUT SEQFileStage

Open IN SEQFileStage and click on stage tab and check on Stage uses fileter command. Now click on the output tab and write your sort command in filter command box.

Your sort command should be:
sort -u <positions of sort keys>

You don't have to redirect it to newfile. It will read from stdin.

This fileter command will dedupe your input file. And you will write the resultant records to another file.

Kris~
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

hi amsh76,

You can Use RowProcCompareWithPreviousValue Rotine in SatgeVariable or as Contraint to remove duplictes.

GoodLuck :)
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I would use three stage variables in the order as given

(a) CurrentValue = {current column or columns concatenated}
(b) SameAsLast = IF (LastValue = CurrentValue) THEN 1 ELSE 0
(c) LastValue = CurrentValue

And in your constraint put NOT(SameAsLast)
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Note : sort -u as such performs a full row comparison.
Sunshine2323
Charter Member
Charter Member
Posts: 130
Joined: Mon Sep 06, 2004 3:05 am
Location: Dubai,UAE

Post by Sunshine2323 »

Hi,

Please refer to the below post for more answers

viewtopic.php?t=92746&highlight=duplicate

Hope this helps :)
Warm Regards,
Amruta Bandekar

<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
kris
Participant
Posts: 160
Joined: Tue Dec 09, 2003 2:45 pm
Location: virginia, usa

Post by kris »

Sainath.Srinivasan wrote:Note : sort -u as such performs a full row comparison.
We can specify positions as well and dedupe occordingly.

Example on fixed width file: sort on two keys with priority order, one being from position 45 to 57 and other being from position 1 to 2

Code: Select all

 sort -u +0.44 -0.57 +0.0 -0.3
Kris~
Post Reply