Ignoring Duplicates while reading from Seq File

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
chowdary
Participant
Posts: 38
Joined: Thu Jun 23, 2005 11:25 am

Ignoring Duplicates while reading from Seq File

Post by chowdary »

Hi all,

I have Seq file like this

ColA ColB
123 ABC
234 CDE
123 CDE
456 HIJ

While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE'

Please throw some suggestions how to solve this scenarion

Thanks
Chowdary
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

first number your rows, then sort your file by ColA and RowNum, use StageVariable for to remember the first line of each ColA-Element and put ColA and the memorized Col B in a HashFile, read thereafter the Hashfile and you have it.

Wolfgang
chowdary
Participant
Posts: 38
Joined: Thu Jun 23, 2005 11:25 am

Post by chowdary »

Hi

Thanks for the reply, I got it a part of it but i didn't understand the part which you are saying about the "Using the StageVariable ot remember the first line of each ColA element". Can you please explain more elabrate on this....

thank you very much
WoMaWil wrote:first number your rows, then sort your file by ColA and RowNum, use StageVariable for to remember the first line of each ColA-Element and put ColA and the memorized Col B in a HashFile, read thereafter the Hashfile and you have it.

Wolfgang
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Chowdary,

this subject has been handled in this forum several times before. Also, it might be a good idea at this point in time to revisit your self-assessment in your user profile.

If you create two stage variables in the following order:

Code: Select all

ColAIsTheSameAsLastRow   IF LastColA = In.Col THEN 1 ELSE 0
LastColA                 In.ColA
then you can use the value of the first to branch your logic.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Or use the Aggregator stage and group by the key columns and use FIRST derivation for all other attribute columns. This effectively gives you the first row, as long as you handle NULL values properly, like with a substitute value to get it thru the Aggregator for non-nullable values (numerics) and then substitute back to NULL.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcshankar
Charter Member
Charter Member
Posts: 91
Joined: Mon Jan 10, 2005 2:06 am

Post by kcshankar »

Hi Chowdary,
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.

In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar

In your case the output will be

COlA ColB Return_Val

123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0

To get unique record use Constraint in the output link Return_Val = 0

Code: Select all

I have Seq file like this 

ColA ColB 
123 ABC 
234 CDE 
123 CDE 
456 HIJ 

While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE' 
Please throw some suggestions how to solve this scenarion 
This routine should not be used more than one place in a job.
Go thru the documents for this routine's limitations.


regards
kcs
kcshankar
Charter Member
Charter Member
Posts: 91
Joined: Mon Jan 10, 2005 2:06 am

Post by kcshankar »

Hi Chowdary,

Before using RowProcCompareWithPreviousValue sort your data.


regards
kcs
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

HI,

You can load the data in a table and then again select the distinct records.

This is costly process in terms of database resources but is beneficial if
millions of records are present.

The solution of performing a sort and comparing with previous value could be very slow for millions of records.

Regards
Sreeni
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Sreeni,

but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.
chowdary
Participant
Posts: 38
Joined: Thu Jun 23, 2005 11:25 am

Post by chowdary »

Hi all,

Thanks guys for all you inputs, I used the routine RowProcCompareWithPreviousValue on sorted data and it seems it is working fine, the users are validating the data. Hopefully it will work.

Once Again thanks for all you suggestions.
Thanks
Chowdary
kcshankar wrote:Hi Chowdary,
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.

In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar

In your case the output will be

COlA ColB Return_Val

123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0

To get unique record use Constraint in the output link Return_Val = 0

Code: Select all

I have Seq file like this 

ColA ColB 
123 ABC 
234 CDE 
123 CDE 
456 HIJ 

While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE' 
Please throw some suggestions how to solve this scenarion 
This routine should not be used more than one place in a job.
Go thru the documents for this routine's limitations.


regards
kcs
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Post by datastage »

I agree with this, and that there is no need to go to a database table here. Obviously the RowProcCompareWithPreviousValue method worked for you, and I also think the other idea involving sorting and then writing to a hashed file is a good one. The ability to have a primary key on the hashed file is all you need, nothing else that comes with a traditional database
ArndW wrote:Sreeni,

but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

HI ArndW,

Thanks for your inputs.

I am some how not convinced of your reasoning that sorting data in a sequential file is faster than the database solution.

I believe that sorting using a database solution is faster
in case of millions of records rather than doing the sort
in a sequential file.

:?

Regards
Sreeni
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Sreeni,

it is very easy for you to reproduce both scenarios, please do so to remove your doubts. That way we aren't dealing with theories or suppositions but cold, hard facts.

I look forward to hearing your results.

p.s. This reminds me of my high-school chemistry teacher. I was doing lab work and told him as part of some experiment that I would melt some sodium chloride (table salt). Instead of telling me I was wrong he asked me to show him - and laughed as I melted and destroyed the test tube while the sodium chloride remained happily in crystalline form. I wouldn't have learned that lesson as well had he just explained to me about melting points.
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

Hi ArndW

Thanks for your advice. I will try this and get back to you.

Regards
Sreeni
Post Reply