Match Cutoffs and Clerical Cutoffs

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
saadmirza
Participant
Posts: 76
Joined: Tue Mar 29, 2005 2:57 am

Match Cutoffs and Clerical Cutoffs

Post by saadmirza »

Hi,
Please help me out in understanding the Cutoffs..For a file having 1000 records how do I set the Match pass cutoffs and clerical cutoffs...the document is very difficult to understand..
can any one explain me it in more easier fashion...

Thanks,
Saad Mirza
JamasE
Participant
Posts: 32
Joined: Sun Aug 31, 2003 5:52 pm

Re: Match Cutoffs and Clerical Cutoffs

Post by JamasE »

Hello,

(I'll assume for the moment that you're not asking for where to set them (the MATCH stage) and that you know how to create a report/extract file to see what the results are.)

I find the best way to work when I get a new file is to set the match and clerical cut offs to zero (because you can't set them lower) and then take a look at the resulting match and from there judge what matches I like. Depending on whether you're trying to keep false positives low or false negatives low, decide on the match cut off appropriately.

As for clerical, we tend to just set that at the same level as match, or otherwise we loose those clerical matches for the next pass. (Took us a while to realise we had forgotten that at one point.)

QS 7.5 is supposed to have a facility for reviewing the clerical review matches (which looks amazingly useful), but we haven't upgraded yet so I can't say what that facility is like.

Cheers,
Jamas
saadmirza
Participant
Posts: 76
Joined: Tue Mar 29, 2005 2:57 am

Post by saadmirza »

Yes I am well aware of how to create the extract file and block specs...but I am really confused as to how to put the figure in the Match cutoff...I always put 3...I dont know what is the dependency...
Is it depending on the no. of fields in block specs.
Is it on the no. of records you are working on..????
really confused

Saad
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's more a heuristic - by working with a representative sample of data you get the cumulative weights of each record, and inspect the duplicates (XA and DA).
The match cutoff sets a threshhold above which the cumulative weight will be understood to be a match. The clerical cutoff sets a threshhold below which a DA record will be understood not to match its corrsponding XA record. Anything in the "grey area" in between needs "clerical review" - that is, human inspection - which is the source of the name "clerical cutoff".
There are statistical models, but eyeballing a representative sample is a good start point - the statistics are complex even if simplifying assumptions (such as the scores are distributed on a normal curve) are made.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
saadmirza
Participant
Posts: 76
Joined: Tue Mar 29, 2005 2:57 am

Post by saadmirza »

Hi Ray,
Thanks for a simplified explanantion...but accordingly we need to know how the weights are calculated so that the cutoff should be set accordingly...if there are 5 records matched then the weights shown lies between the range of 11 to 13 and my match and clerical cutoff is set to 6...so my question is...is the weight calculation done on the number of records matched or no. of fields matched...reading the manual, i coudnt understand what they are trying to say..they use some logarthmic calculations....but bottomline is since at this point i dont know how to calculate the cutoffs, can i use a minimal figure between 3 to 5 so that i always get a propoer duplicate records and no record is skipped...

Please suggest me since there is not much time left for my build delivery...will be really grateful for you precious response..

Thanks & Regards,
Saad Mirza
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The weights used to compare against your cutoffs are the totals of the individual match weights for the fields in the (possibly standardized) record, usually termed "aggregate weights". The individual field match weights can be positive or negative, but the aggregate weight is usually positive (otherwise the record would tend not to be in the same block. Examples are given in the manual. The algorithms used are based on some heavy-duty statistical theory (which is why it's called a probabilistic model), and aren't in the public domain.

To your specific example, with aggregate weights all between 11 and 13, a match cutoff of 6 will mean that all your records will be treated as matches. With a match cutoff of 12 any record with an aggregate weight above 12 is treated as a match, anything below is treated as a non-match, though the clerical cutoff specifies whether it's in the "grey area" or a definite non-match.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
saadmirza
Participant
Posts: 76
Joined: Tue Mar 29, 2005 2:57 am

Post by saadmirza »

Hi Ray,
I tried increasing the cut-offs and making it greater than what is appearing in the RPT file...but then also the record is getting identified as duplicate...
My initial RPT file was showing wt. as 21.94 for a duplicate...
I modified my pass and made the cutoffs as 22

and after running the job the RPT file is showing the weiht as 11 for the same records...how does the weights are differing??does these weights differ based on some algorith on each run???I am really confused...my requirement is that if slight match is found i should treat them as match...it should not overgo my Pass....

thx anyways Ray for your simplified explanantions.
Saad
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can always make multiple passes (this is what the Pass column in the report indicates). Each pass can refine the criteria of the earlier one.

I don't believe that the exact algorithms are published. The numbers are generated for each field, and represent the "information content".

So, if a match is found in a field with very many distinct values, that is given a much higher field match weight. If a match is found in a field with very few distinct values, that field gets a lower weight even though the values might be identical, on the basis that it's not as selective - it doesn't contain as much "information" towards making the "record is a match" decision. This weight (which is the one that you see, and the one against which cutoffs are applied) is the aggregate of all the field weights in the record.

That the XA record has a weight is a measure of how likely it is, based on the number of distinct values in each of its fields, that a match found will be a true match.

But it IS a probabilistic model. Have you asked your support provider whether any technical documentation is available? And, if so, with what result?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply