DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
rohit_mca2003
Participant



Joined: 08 Oct 2008
Posts: 40

Points: 475

Post Posted: Wed Jan 30, 2019 7:55 pm Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
Additional info: De-duplication based on 2 different columns in 2 records
Hi,
I have a file where I need to retain only 1 record out of 2 records where 'Col1/Party1 of record1' is equal to 'Col2/Party2 of other record'. These records may not come in sequence.

Sample:
-----------
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 100

I need to retain only 1 record (either 'Record1' or 'Record2'). There is no other common key in between these records and these records are not ordered in the file.

Thanks,

_________________
Rohit
UCDI



Group memberships:
Premium Members

Joined: 21 Mar 2016
Posts: 355

Points: 3699

Post Posted: Thu Jan 31, 2019 3:52 am Reply with quote    Back to top    

is the data really big? If not, can you copy the input and then do a join stage? I think that might work, what you wold do is something like

read file
copy stage
modify left, add party1 as new field 'key'
modify right, add party2 as new field 'key'
join stage, on key

if the data is too big for this, there may be a way to do it with less brute force.

don't forget to hash partition on 'key'.
Rate this response:  
Not yet rated
rohit_mca2003
Participant



Joined: 08 Oct 2008
Posts: 40

Points: 475

Post Posted: Thu Jan 31, 2019 3:34 pm Reply with quote    Back to top    

Thanks for the reply. I thought about this approach but it may not work. As what ever join will happen with Record1 (party1) & Record2 (party2), same join will happen with 'Record2 (party1) and Record1 (Party2).
So we will end up with 2 records again.

Thanks,

_________________
Rohit
Rate this response:  
Not yet rated
rohit_mca2003
Participant



Joined: 08 Oct 2008
Posts: 40

Points: 475

Post Posted: Thu Jan 31, 2019 7:58 pm Reply with quote    Back to top    

After certain tests, got the solution.
I compared the 'Col1' and 'Col2' (either it is number or string) and created a new KEY column.

If Col1<= Col2 then Key=Col1:Col2
else Key=Col2:Col1

This will give same key for both of the records and then de-duplicate the record based on this 'Key' column.

Thanks for support.

_________________
Rohit
Rate this response:  
Not yet rated
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2854
Location: USA
Points: 21697

Post Posted: Fri Feb 01, 2019 1:46 am Reply with quote    Back to top    

Your solution should work based on your sample data.

In case your data comes in a different way, like below, where 200=200 but the other values differ, then it would not remove the duplicates.

Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 300

Have you tested this scenario as well, or is there no way the data would be like this?

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
Not yet rated
UCDI



Group memberships:
Premium Members

Joined: 21 Mar 2016
Posts: 355

Points: 3699

Post Posted: Wed Feb 06, 2019 8:15 am Reply with quote    Back to top    

you can do something snarky like make the key A*B (200*300 is the same as 300*200) or sort the 2 keys (stage variables can do this for 2 values really simple) and put them in the same order every time (so that 200,300 is 200,300 no matter how they came into the functions). Multiply has a risk of other ways to get the same product (1, 60000 here) but there is probably some nifty math way that will work for your data.
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours