Matching Frequencies/cutoff value

dj · Post by dj » Tue Jul 26, 2016 5:03 am

Hi All,

I have setup the tutorial jobs for Match specification and trying to understand the rules. There are few clarifications.

1) what is the param1 signifcance? First_name it is 850 and last_name 800 and few columns with no param specified. On what basis it should be mentioned?

2) The lowest weight for the set of sample data is 14 and height-est being 50. The cut of value for match and clerical is set as 12. After verifying if the lowest wt pair is valid can we set some value after the lowest wt as cut-off?

How does the cutoff value set for a specific file works for other source system data?

3) On web-service call, how do we set the frequency file? For every array of call do we need to create the corresponding Match freq file for the data prior to Specification stage?

Though it depends on the data and requires iterative analysis, to get start with any inputs to the above would be of great help:-)

Thanks.

stuartjvnorton · Post by **stuartjvnorton** » Tue Jul 26, 2016 6:34 pm

Ok, it's been a little while, but I'll have a crack:

1 - param1 is a parameter for the type of comparison you are using, and different comparisons obviously require different parameters (either 0, 1 or 2).
In the case of the 850 and 800, they would be for one of the uncert functions, and tells the function how strict it needs to be when determining whether the names match or not. 850 is the tightest, and 800 is not as tight but still quite strict.
Look to the doco as to what each comparison method requires.

2 - The cutoff values are specific to the specific match pass. If you are matching on 3 fields then a 20 might be quite a strong match, but for another match pass that compares 7 fields, a 20 might not even be a match.

As for setting the cutoff, you review the matches and clericals and poke around the non-matches to understand what looks like a reliable match and what doesn't.
Set the match cutoff with the idea that anything above it is a match that you're confident with (this will change depending on what you're using the matching for).
If you don't get consistent bands of scores that line up with what you think a match is or is not, tweak the scores for each of the individual field comparison scores to manipulate the overall score and create clearer bands of scores.

These scores tend to be relevant for the data you're matching. Data from other systems may have data that is more or less populated, consistent, etc and that will have an effect of what your matches look like.

3 - I would have thought that the frequency file relates to the data you are matching against, and how often you regenerate the frequency files depends on how volatile that data is.
Regardless, you would regenerate it separately to the match call itself.

ray.wurlod · Post by **ray.wurlod** » Wed Jul 27, 2016 12:00 am

On a web service call, you would quite likely use a two-file match. Yes, you would need to generate frequencies for the tiny number of rows arriving from the web client, but that will not take very long. The match frequencies for your "values on file" set can be pre-calculated at an interval suitable for your other processing.

dj · Post by dj » Wed Jul 27, 2016 11:56 am

Thanks for the details,helped a lot for better understanding!

Regarding the frequency file generation, bit confused as we were just suggested to keep a default file.

1. Default file is definitely not entire source system data, its just generated for a particilar system.in this case will not affect the match specification output for other systems?
"For every match input data its corresponding freq should be generated", is it not the true?

Thanks

ray.wurlod · Post by **ray.wurlod** » Wed Jul 27, 2016 3:31 pm

The key word is "should". QualityStage will work without accurate match frequencies, but the probabilities that it uses will not be as accurate. You can certainly generate a standard set of frequencies, but they will likely bear little relationship to reality.

dj · Post by dj » Thu Aug 11, 2016 4:30 am

Thanks Ray!

In the tutorial example Two source reference match, after running passes the Reference data and the data source all fall under MB.

In single unduplicate match, it is match and duplicate within the set.

But what does all the data source MB means all the candidate records are matching with the transaction record? is it not duplicate?

In the below example, even the records with missing address though have got low weight age it is as type MB.
(may be the specifications is working like that).

But in a 2 source reference match, when does record falls a duplicate type

what does match (MB)and duplicate(DB) means here?

Code: Select all

	   MA	T	TC001-T001	James	Fenimore	Cooper	1L Mohicans Run	Apartment 19C		Morgantown	WV	26505-0019	US
300	MB	C	TC001-C001	James	Fenimore	Cooper	1L Mohicans Run	Apartment 19C		Morgantown	WV	26505-0019	US
232	MB	C	TC001-C002	James	Fenimore	Cooper	P.O. Box 19C			Morgantown	WV	26505	US
232	MB	C	TC001-C003	James	Fenimore	Cooper				Morgantown	WV	26505	US
200	MB	C	TC001-C004	James	Fenimore	Cooper

rjdickson · Post by **rjdickson** » Mon Aug 15, 2016 6:21 am

Many questions in the same thread, but let me try to help...

Your last example does not have any DB records, nor does it report critical information to help with matching question. My personal common practice is to output set, type, pass, weight, and original data. You can then sort to see things in the right order. For two-file matches, you have both the reference and data on the same line....

In a two-file match, you will see DB records when you have selected 'many-to-one' match type. Please see: http://www.ibm.com/support/knowledgecen ... nkage.html

Another word about frequencies for real-time jobs... my commob practice is to never generate the frequencies for the incoming data - it is typically one records. Instead, use the frequencies from the last full-volume run/generation. This will give you a more accurate match.