How to retain ONE record after comparing 2 different columns

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rohit_mca2003
Participant
Posts: 41
Joined: Wed Oct 08, 2008 9:19 am

How to retain ONE record after comparing 2 different columns

Post by rohit_mca2003 »

Hi,
I have a file where I need to retain only 1 record out of 2 records where 'Col1/Party1 of record1' is equal to 'Col2/Party2 of other record'. These records may not come in sequence.

Sample:
-----------
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 100

I need to retain only 1 record (either 'Record1' or 'Record2'). There is no other common key in between these records and these records are not ordered in the file.

Thanks,
Rohit
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

is the data really big? If not, can you copy the input and then do a join stage? I think that might work, what you wold do is something like

read file
copy stage
modify left, add party1 as new field 'key'
modify right, add party2 as new field 'key'
join stage, on key

if the data is too big for this, there may be a way to do it with less brute force.

don't forget to hash partition on 'key'.
rohit_mca2003
Participant
Posts: 41
Joined: Wed Oct 08, 2008 9:19 am

Post by rohit_mca2003 »

Thanks for the reply. I thought about this approach but it may not work. As what ever join will happen with Record1 (party1) & Record2 (party2), same join will happen with 'Record2 (party1) and Record1 (Party2).
So we will end up with 2 records again.

Thanks,
Rohit
rohit_mca2003
Participant
Posts: 41
Joined: Wed Oct 08, 2008 9:19 am

Post by rohit_mca2003 »

After certain tests, got the solution.
I compared the 'Col1' and 'Col2' (either it is number or string) and created a new KEY column.

If Col1<= Col2 then Key=Col1:Col2
else Key=Col2:Col1

This will give same key for both of the records and then de-duplicate the record based on this 'Key' column.

Thanks for support.
Rohit
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Your solution should work based on your sample data.

In case your data comes in a different way, like below, where 200=200 but the other values differ, then it would not remove the duplicates.

Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 300

Have you tested this scenario as well, or is there no way the data would be like this?
Choose a job you love, and you will never have to work a day in your life. - Confucius
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

you can do something snarky like make the key A*B (200*300 is the same as 300*200) or sort the 2 keys (stage variables can do this for 2 values really simple) and put them in the same order every time (so that 200,300 is 200,300 no matter how they came into the functions). Multiply has a risk of other ways to get the same product (1, 60000 here) but there is probably some nifty math way that will work for your data.
Post Reply