Removing duplicates from 20 million records
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 22
- Joined: Thu Jun 02, 2005 5:12 am
Removing duplicates from 20 million records
Hi all,
We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.
Could you please suggest any approch to resolve this issue.
Thanks in advance.
We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.
Could you please suggest any approch to resolve this issue.
Thanks in advance.
Re: Removing duplicates from 20 million records
Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).
Kris
Where's the "Any" key?-Homer Simpson
Where's the "Any" key?-Homer Simpson
-
- Participant
- Posts: 22
- Joined: Thu Jun 02, 2005 5:12 am
Re: Removing duplicates from 20 million records
Hi,kris007 wrote:Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).
If your input is sequential file then check Filter option to use /bin/sort command to sort ur files.
if you are using unix server then sort it using a script.
-
- Participant
- Posts: 22
- Joined: Thu Jun 02, 2005 5:12 am
Sorting a mere 10 million records should be no problem. You might need to tell the sort command to use a different temporary directory if your /tmp is dimensioned rather small. Using sorted data on the aggregator stage will speed up the processing and also have the effect that the stage will use very little memory.
Actually, if you are using unix commands then you might as well use the unix uniq command to remove duplicates.
Another question, why are you using aggregator to remove duplicates, why not just pass the input through a hashed file?
Another question, why are you using aggregator to remove duplicates, why not just pass the input through a hashed file?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Participant
- Posts: 22
- Joined: Thu Jun 02, 2005 5:12 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
QualityStage can perform single file unduplication (even using fuzzy matching criteria) as well as two-file matches and removal of duplicates therefrom using various strategies.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 22
- Joined: Thu Jun 02, 2005 5:12 am
Perhaps I need to check for any parameter change required.ArndW wrote:Kumar,
on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.
![Confused :?](./images/smilies/icon_confused.gif)
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'