Duplicate Job Recommendation

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
iHijazi
Participant
Posts: 45
Joined: Sun Oct 24, 2010 12:05 am

Duplicate Job Recommendation

Post by iHijazi »

Hello guys,

This is not an issue related post, but more to suggestions on what you think the best approach is.

We have a table about 60+M records, and the client needs to have some statistics about duplicate in this table, and later on remove.

There are four rules to find duplicates on this table:
1. First two parts of full Arabic name, nationality, year of birth and an ID number. All those columns reside on the same table, but the ID number is on a different table, and it's 1-to-many. As in, the person can have multiple IDs, with different types. Obviously, the first three should match, and for the ID, at least one record of the many should match to call it a duplicate.
2. First two parts of full English name, nationality, year of birth and mother's Arabic name.
3. Full Arabic name, nationality, year of birth, and mother Arabic name.
4. Full English name, nationality, year of birth and mother's English name

The above four rules are OR. As in, if any of them is passed, then a duplicate is valid. The last 3 are somehow straight forward, but the first is a bit challenging.

If you have such a case, what is the approach you would take to make the report happen using QualityStage?

It would be highly appreciate to share with me your thoughts and experience, and the steps you would take.

Thank you for any input, I'm still kicking my way through in QS after years in working with DS.

-Issam
Not only thoughts, but a little bit of experience.
iHijazi
Participant
Posts: 45
Joined: Sun Oct 24, 2010 12:05 am

Post by iHijazi »

Well, since no one tried to comment on this, I think I'll go with my logic :)

Steps:
1. Prepare test data, if you have millions, it would be good idea to generate something around 25,000 or more. The more means slower testing.
2. You need to do Frequency job for this sample data.
3. You need to create a Specification using Match Designer. And for match type, choose Unduplicate Dependent.
4. Do the Test Environment configuration and test your test data/freq, write down best score that match your expectations. You may need to spend good time on this till you figure out best method and scores.
5. Do Frequency job for your whole data
6. Create a new job which has the unduplicate stage, attach it to the Specification we created above.
8. Load your results (the ones you need).

That's generally speaking.

Hope this also helps a newbie QS like myself.
Not only thoughts, but a little bit of experience.
Post Reply