Removing duplicates from 20 million records

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Removing duplicates from 20 million records

Post by m_keerthi2005 »

Hi all,

We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.

Could you please suggest any approch to resolve this issue.

Thanks in advance.
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Re: Removing duplicates from 20 million records

Post by kris007 »

Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).
Kris

Where's the "Any" key?-Homer Simpson
m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Post by m_keerthi2005 »

Thanks for fast replying
we are using server jobs. could you please tell us how exactly we need to use stage variables to get this done.
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Re: Removing duplicates from 20 million records

Post by rumu »

kris007 wrote:Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).
Hi,
If your input is sequential file then check Filter option to use /bin/sort command to sort ur files.
if you are using unix server then sort it using a script.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

You can also preprocess the file using unix sort - u command.
m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Post by m_keerthi2005 »

we can sort the data before doing aggregation. Would this work for 10 million records.

Any thoughts on this
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Sorting a mere 10 million records should be no problem. You might need to tell the sort command to use a different temporary directory if your /tmp is dimensioned rather small. Using sorted data on the aggregator stage will speed up the processing and also have the effect that the stage will use very little memory.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Actually, if you are using unix commands then you might as well use the unix uniq command to remove duplicates.
Another question, why are you using aggregator to remove duplicates, why not just pass the input through a hashed file?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Post by m_keerthi2005 »

we have also tried with hash file. but that also was taking lot of time than DB query. we are using only windows and server jobs to resolve this issue. No unix commands we are using
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

QualityStage can perform single file unduplication (even using fuzzy matching criteria) as well as two-file matches and removal of duplicates therefrom using various strategies.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Post by m_keerthi2005 »

we do not have any Quality stage. we have to resolve this using server jobs.

could you please through some more light how to resove this issue.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

M_keerthi2005,

you have now been given some possible solutions. What more do you need? You can pre-sort your data in UNIX and then feed to it DataStage for deduplication. You can sort and remove duplicate directly in the UNIX sort.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

UNIX will also fail with this number. I have tried in AIX.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Kumar,

on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

ArndW wrote:Kumar,

on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.
Perhaps I need to check for any parameter change required. :?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply