Page 1 of 1

How to retain ONE record after comparing 2 different columns

Posted: Wed Jan 30, 2019 7:55 pm
by rohit_mca2003
Hi,
I have a file where I need to retain only 1 record out of 2 records where 'Col1/Party1 of record1' is equal to 'Col2/Party2 of other record'. These records may not come in sequence.

Sample:
-----------
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 100

I need to retain only 1 record (either 'Record1' or 'Record2'). There is no other common key in between these records and these records are not ordered in the file.

Thanks,

Posted: Thu Jan 31, 2019 3:52 am
by UCDI
is the data really big? If not, can you copy the input and then do a join stage? I think that might work, what you wold do is something like

read file
copy stage
modify left, add party1 as new field 'key'
modify right, add party2 as new field 'key'
join stage, on key

if the data is too big for this, there may be a way to do it with less brute force.

don't forget to hash partition on 'key'.

Posted: Thu Jan 31, 2019 3:34 pm
by rohit_mca2003
Thanks for the reply. I thought about this approach but it may not work. As what ever join will happen with Record1 (party1) & Record2 (party2), same join will happen with 'Record2 (party1) and Record1 (Party2).
So we will end up with 2 records again.

Thanks,

Posted: Thu Jan 31, 2019 7:58 pm
by rohit_mca2003
After certain tests, got the solution.
I compared the 'Col1' and 'Col2' (either it is number or string) and created a new KEY column.

If Col1<= Col2 then Key=Col1:Col2
else Key=Col2:Col1

This will give same key for both of the records and then de-duplicate the record based on this 'Key' column.

Thanks for support.

Posted: Fri Feb 01, 2019 1:46 am
by qt_ky
Your solution should work based on your sample data.

In case your data comes in a different way, like below, where 200=200 but the other values differ, then it would not remove the duplicates.

Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 300

Have you tested this scenario as well, or is there no way the data would be like this?

Posted: Wed Feb 06, 2019 8:15 am
by UCDI
you can do something snarky like make the key A*B (200*300 is the same as 300*200) or sort the 2 keys (stage variables can do this for 2 values really simple) and put them in the same order every time (so that 200,300 is 200,300 no matter how they came into the functions). Multiply has a risk of other ways to get the same product (1, 60000 here) but there is probably some nifty math way that will work for your data.