Cutoff values for QualityStage match

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
marcelo_almeida
Premium Member
Premium Member
Posts: 46
Joined: Wed Jun 28, 2006 9:54 am
Location: Brasília - Brazil

Cutoff values for QualityStage match

Post by marcelo_almeida »

I implemented a process in Match QualityStage v8.7 and I have some problems, described below as follows:

Processing a source of 2900 records, 3 records from one person with identical national identity number, received respectively in the field qsMatchType this values: 'MP', 'DA' and 'DA' in the field qsMatchWeight this values '36.54', '23.2' and '28.53'.
In this case I used the following cutoff values: Match=20 and Clerical=10
This behavior is correct and expected.

However, I decided to examine this particular person and put a filter on the source to only process it, this way, I reduced my source of 2900 records for 3. In this way, the three records of the person were categorized as residuals, even though they are the same person.
I changed the cutoff values for Match=99 e Clerical=0 and still continued to be residuals.

So, I increment my source, including more 26 different people.
With the original Cutoff values (Match=20 and Clerical=10) that three people continued to be residuals.
With the cutoff values Match=99 and Clerical=0 I get this values respectively: qsMatchType ('MP', 'CP' and 'CP') and qsMatchWeight ('13.55', '6.75' and '9.47')

In my case, I process different amounts of records every day, one day the quantity is large and the other is small.
How should I do to set the values of the cutoff if the value qsMatchWeight is influenced by the amount of records in the source?

Thank you very much
Marcelo Almeida
antoniomarcelo@gmail.com
Brazil
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi Marcelo,

This sounds like normal, expected behavior if your are regenerating the frequency files every time. Try generating the frequency file with the 'full volume', and then use that with just the three records and you should see the exact same results.

By the way, you are using the Match Designer, right? You get a lot more obvious visibility into the matches, and a lot more control over what your see...
Regards,
Robert
marcelo_almeida
Premium Member
Premium Member
Posts: 46
Joined: Wed Jun 28, 2006 9:54 am
Location: Brasília - Brazil

Post by marcelo_almeida »

Hi Robert,

Yes, I am using the Match Designer.
I tried what you suggested and it worked well.
How often do you think I need to update the frequency file? And I must do this by always using the full source?

And if I start a service that does not have full source? Only those 3 records. How would I do this work?

Thank you very much
Marcelo Almeida
antoniomarcelo@gmail.com
Brazil
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi Marcello,

The common practice is to update the full volume frequency file either every 'n' months, or after a lot of 'new' data has been added to the source. 'n' is a bit subjective, but many companies are ok with 3-6 months. 'new' generally means matching data that you have not seen before. If, for example, you are dealing with Brazil exclusively, and then add the US, you will want to regenerate your frequencies.

For your service, the common practice is to ALWAYS use the full volume frequencies. The common practice, for exactly the reason you identified, is to no use the Match Frequency stage in the service job.
Regards,
Robert
marcelo_almeida
Premium Member
Premium Member
Posts: 46
Joined: Wed Jun 28, 2006 9:54 am
Location: Brasília - Brazil

Post by marcelo_almeida »

Hi Robert,

Thanks for your help!

Now I understood as should be the procedure to update the reference file and I will do as you advised.

However, I do not understand why not works only with 3 records.

But okay, this is enough for my purpose.

You would have some additional material of good practice that you could share with me by email? I would love to study more on this subject.

Thank you very much!
Marcelo Almeida
antoniomarcelo@gmail.com
Brazil
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi Marcelo,

The reason the scores are different with just 3 records is because the frequency of occurance effects the score.

Some references :
http://publib.boulder.ibm.com/infocente ... 3%68%22%20

QualityStage Redbook: http://www.redbooks.ibm.com/abstracts/sg247546.html
Regards,
Robert
Post Reply