how to remove the duplicate records
Moderators: chulett, rschirm, roy
how to remove the duplicate records
Hi,
How can i remove the duplicate records by using a sequential stage componenet.
my flow is like this...
Sequential stage--->transformer--->sequential stage/Oracle OCI
suppose i am using a flat file as a source and it has having some duplicate records.I need to remove those duplicate records in tranformer stage and want to insert the clean records into the target file or target table.
I need your help .Please give me some idea how to comeout from this problem.Its very urgent to me.
Regards
Prithivi
How can i remove the duplicate records by using a sequential stage componenet.
my flow is like this...
Sequential stage--->transformer--->sequential stage/Oracle OCI
suppose i am using a flat file as a source and it has having some duplicate records.I need to remove those duplicate records in tranformer stage and want to insert the clean records into the target file or target table.
I need your help .Please give me some idea how to comeout from this problem.Its very urgent to me.
Regards
Prithivi
Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.
If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.
If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
ArndW wrote:Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.
If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.
prithivi-- Can u tell me briefly.I have used the sort stage and getting the data in sorted order.Then after that how can i check the duplicate records through the stage variable.
need more infomation about it.
Prithivi
You already have needful information above. Are you trying to do what you are trying to do? Or you want someone to do it for you ?
Here is one solution:
Use filter command(sort command) in sequential file stage.
IN SEQFileStage-------->Xfm-------->OUT SEQFileStage
Open IN SEQFileStage and click on stage tab and check on Stage uses fileter command. Now click on the output tab and write your sort command in filter command box.
Your sort command should be:
sort -u <positions of sort keys>
You don't have to redirect it to newfile. It will read from stdin.
This fileter command will dedupe your input file. And you will write the resultant records to another file.
Kris~
Here is one solution:
Use filter command(sort command) in sequential file stage.
IN SEQFileStage-------->Xfm-------->OUT SEQFileStage
Open IN SEQFileStage and click on stage tab and check on Stage uses fileter command. Now click on the output tab and write your sort command in filter command box.
Your sort command should be:
sort -u <positions of sort keys>
You don't have to redirect it to newfile. It will read from stdin.
This fileter command will dedupe your input file. And you will write the resultant records to another file.
Kris~
I would use three stage variables in the order as given
(a) CurrentValue = {current column or columns concatenated}
(b) SameAsLast = IF (LastValue = CurrentValue) THEN 1 ELSE 0
(c) LastValue = CurrentValue
And in your constraint put NOT(SameAsLast)
(a) CurrentValue = {current column or columns concatenated}
(b) SameAsLast = IF (LastValue = CurrentValue) THEN 1 ELSE 0
(c) LastValue = CurrentValue
And in your constraint put NOT(SameAsLast)
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Charter Member
- Posts: 130
- Joined: Mon Sep 06, 2004 3:05 am
- Location: Dubai,UAE
Hi,
Please refer to the below post for more answers
viewtopic.php?t=92746&highlight=duplicate
Hope this helps
Please refer to the below post for more answers
viewtopic.php?t=92746&highlight=duplicate
Hope this helps
Warm Regards,
Amruta Bandekar
<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
Amruta Bandekar
<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
We can specify positions as well and dedupe occordingly.Sainath.Srinivasan wrote:Note : sort -u as such performs a full row comparison.
Example on fixed width file: sort on two keys with priority order, one being from position 45 to 57 and other being from position 1 to 2
Code: Select all
sort -u +0.44 -0.57 +0.0 -0.3