how to remove the duplicate records

Prithivi · Post by **Prithivi** » Thu Jun 02, 2005 8:10 am

Hi,

How can i remove the duplicate records by using a sequential stage componenet.
my flow is like this...

Sequential stage--->transformer--->sequential stage/Oracle OCI

suppose i am using a flat file as a source and it has having some duplicate records.I need to remove those duplicate records in tranformer stage and want to insert the clean records into the target file or target table.

I need your help .Please give me some idea how to comeout from this problem.Its very urgent to me.

Regards
Prithivi

ArndW · Post by **ArndW** » Thu Jun 02, 2005 8:16 am

Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.

If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Jun 02, 2005 8:17 am

You can do a sort with stage variable check or use agg if the data volume is low.

There are many other ways and everything depends on detailed analysis of what you are doing and what you wish to achieve.

Prithivi · Post by **Prithivi** » Thu Jun 02, 2005 8:43 am

ArndW wrote:Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.

If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.

prithivi-- Can u tell me briefly.I have used the sort stage and getting the data in sorted order.Then after that how can i check the duplicate records through the stage variable.

need more infomation about it.
Prithivi

amsh76 · Post by **amsh76** » Thu Jun 02, 2005 9:47 am

If your volume is not that high, you can always write the records to HF, but make sure you sort them before writing.

HF will remove the duplicates for you.

kris · Post by **kris** » Thu Jun 02, 2005 9:51 am

You already have needful information above. Are you trying to do what you are trying to do? Or you want someone to do it for you ?

Here is one solution:

Use filter command(sort command) in sequential file stage.

IN SEQFileStage-------->Xfm-------->OUT SEQFileStage

Open IN SEQFileStage and click on stage tab and check on Stage uses fileter command. Now click on the output tab and write your sort command in filter command box.

Your sort command should be:
sort -u <positions of sort keys>

You don't have to redirect it to newfile. It will read from stdin.

This fileter command will dedupe your input file. And you will write the resultant records to another file.

Kris~

martin · Post by **martin** » Thu Jun 02, 2005 9:52 am

hi amsh76,

You can Use RowProcCompareWithPreviousValue Rotine in SatgeVariable or as Contraint to remove duplictes.

GoodLuck

ArndW · Post by **ArndW** » Thu Jun 02, 2005 9:56 am

I would use three stage variables in the order as given

(a) CurrentValue = {current column or columns concatenated}
(b) SameAsLast = IF (LastValue = CurrentValue) THEN 1 ELSE 0
(c) LastValue = CurrentValue

And in your constraint put NOT(SameAsLast)

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Jun 02, 2005 10:17 am

Note : sort -u as such performs a full row comparison.

Sunshine2323 · Post by **Sunshine2323** » Thu Jun 02, 2005 1:10 pm

Hi,

Please refer to the below post for more answers

viewtopic.php?t=92746&highlight=duplicate

Hope this helps

kris · Post by **kris** » Thu Jun 02, 2005 1:45 pm

Sainath.Srinivasan wrote:Note : sort -u as such performs a full row comparison.

We can specify positions as well and dedupe occordingly.

Example on fixed width file: sort on two keys with priority order, one being from position 45 to 57 and other being from position 1 to 2

Code: Select all

 sort -u +0.44 -0.57 +0.0 -0.3

Kris~