SELECT distinct v/s Remove Duplicate Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

SELECT distinct v/s Remove Duplicate Stage

Post by jerome_rajan »

Hi,

We have a table that currently has about 60 million records and will keep growing. This table contains a lot of duplicates. To extract only the unique records from a table of such a large volume, will it be efficient to use the remove duplicates stage or should I fire a DISTINCT in the SQL query of the DB2 connector?

I am currently testing it and will post the result when the job finishes but it'll be helpful if anyone can share a prior experience.
Thanks
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
ssreeni3
Participant
Posts: 29
Joined: Fri May 18, 2012 1:35 am

Post by ssreeni3 »

Hi jerome_rajan,

My opinion is:

1)Oracle DISTINCT keyword treats first record in the group as a unique record and the remaing records in that goup are duplicate records of the first record based on the value/values of key/keys used in the DISTINCT.

2)RemoveDuplicate Stage has an options Retain First,Retain Last Options.

3)RemoveDuplicate Stage Retain First Option is equivalent to Oracle Distinct.

Experts,correct me if i am wrong.

Thanks,
Srini
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Ssreeni,

I know what the SQL distinct does and how the remove duplicate stage functions. My question is which method would be me more efficient for a data volume of 60 million + records.
I am not bothered about which record to retain since the records are pure duplicates i.e. every field in the duplicate record has the exact same value as the original record.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Oops! Just realized I posted this in the wrong forum :( . Can the admin please move it to the right forum?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I try to remove excess data as early as possible in a job. In this case the distinct at the database level will save overhead, particularly if the DB is remote.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

jerome_rajan wrote:Oops! Just realized I posted this in the wrong forum :( . Can the admin please move it to the right forum?
You betcha. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If the basis for "distinct" is indexed in the database it will make a huge amount of sense to include DISTINCT in the query, mainly for the reason Arnd articulated, but also because the query can be resolved entirely in the index, a huge performance gain.

Relative cost (volume of distinct rows / volume of rows) should be your guideline otherwise about where to apply this constraint. Available spare capacity on the database server versus the DataStage server will also be a consideration.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply