Remove Duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
India2000
Participant
Posts: 274
Joined: Sun Aug 22, 2010 11:07 am

Remove Duplicates

Post by India2000 »

Hi,

I have a scenario where I need to remove duplicates using a complete record based on Y or N indicator. A few columns in the record have nulls. How do I need to partition?

This is what I have done, sorted the input rows using sort stage with Indicator in desc. Partitioned the data using all columns except the indicator. Then used the remove duplicate stage with same partition and sorted using the indicator and other key columns (text columns).

Remove duplicate is not working correctly. Sometimes it works and sometimes doesn't. Can any one let me know where exactly I'm going wrong.

Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You've lost me but any time I see "sometimes works, sometimes doesn't" complaint I have to ask - does it "always work" if you run it on a single node?
-craig

"You can never have too many knives" -- Logan Nine Fingers
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Neither Craig or I understand your problem description. Are you removing duplicate records based on the entire record (all columns) being duplicated? If so, what role does the indicator column play?
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

I am assuming that what he is saying is, other than the indicator column, he wants to find duplicates of the rest of the columns. If there is a Y in the indicator column of a row, it is a duplicate of the previous row in all columns other than the indicator column.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

I don't think you can sort off the indicator. I think you need to sort off the data columns that you expect to be identical. And you should partition by the same (one or more of the columns that you expect to be identical, hashed).

If there are many columns you might want to do a checksum on them and hash/sort off that.
Post Reply