Qualitystage Reference Match weights

mmaric · Post by **mmaric** » Fri Apr 05, 2013 12:22 pm

Hi all,
I am in need of a little help. I appologize if this topic was alredy disussed I did search but could not find my answer.
We are trying to match incomming data to our master record set in database.
Database contains bussiness names along with address and phone information.
We have designed mathing specification with 2 passes , First pass does blocking on the phone number and second pass blocks on FSA (part of the postal code ) and custom created addresskey. All works as expected but with some issues. Lot of small companies often share same address (industrial part of the city .. ) and may has some similarity in the name,
For example Canada Foods INC would match Canada Records with small agreement weight,
All data is standardized prior matching and proper match frequiency is created. My question is how to favor on coulum more than otehers in composite weight. I wuld like name to contribute more to the overal weight then address data. We added override for the name dissagreement but in this case it does not help because name is matching to certain degree.
Would scaled override do good job or shoud I use total override?
Second pass matche on all standardized address coulumns and standardized name match.
Thanks all, Marinko

ray.wurlod · Post by **ray.wurlod** » Fri Apr 05, 2013 3:04 pm

I find it easier to explain scaled (multiplier) weight overrides to the business. An override greater than 1 adds influence for that column in the composite weight; an override less than 1 reduces that column's influence.

mmaric · Post by **mmaric** » Mon Apr 08, 2013 6:21 am

Thank Ray, Can not see your post yet (just paid my membership few minutes ago). Here are some other considerations I have.
I did separate testing on name matching and name alone is getting weights between 5 and 20. Around 10 I am getting dependable match. In my match I am comparing all elements of standardized address. So this is a bit of issue to me as having address of RR 1 and address of 13A King Street East Unit 11B will produce considerable different composite weights due to more elements to compare in second address. This makes it harder to define cutoff weight. Here is some thinking.
Leave match as is set cutoff weight high to get good matches
create separate match for clerical records. Use qsmatchsetid to block and compare names only. Set cutoff to get reliable matches for the name, Consider address type??
Anther thinking is to separate records prior to matching based on address type and have separate matches. Same matching specs can be used but with different cutoff weight.
I am looking forward to hear from you and other members that have more experience on this topic as it is pretty new to me.
Thanks

mmaric · Post by **mmaric** » Mon Apr 08, 2013 12:46 pm

Just found this:

http://www-01.ibm.com/support/docview.w ... wg21573053

Since I am using MULT_UNCERT overides will not work.

stuartjvnorton · Post by **stuartjvnorton** » Mon Apr 08, 2013 7:25 pm

mmaric wrote:In my match I am comparing all elements of standardized address. So this is a bit of issue to me as having address of RR 1 and address of 13A King Street East Unit 11B will produce considerable different composite weights due to more elements to compare in second address. This makes it harder to define cutoff weight. Here is some thinking. Leave match as is set cutoff weight high to get good matches create separate match for clerical records. Use qsmatchsetid to block and compare names only. Set cutoff to get reliable matches for the name, Consider address type??

That can be an issue for matching standardised addresses, where the absence of "optional" fields can have actual meaning.

Ray has already discussed weighting fields based on importance. This still may not help where you can get more for a non-match than a match. eg: 2 units in the same building where the unit number doesn't match still get a higher match score than 2 houses that match exactly.

An option is to use weight overrides to impose a larger negative value for disagreement on these optional fields. Another is to take the optional fields and give them a nominal score where they are absent for both source and reference.

mmaric · Post by **mmaric** » Tue Apr 09, 2013 7:24 am

Thanks Stuart,
I have looked at defalut agreement and disagreement values and they are as bellow. I guess StreetSuffixQualifier can be removed as there is no values at all in the data. I would appreciate if you coud reccomend some overrides. Also would it help if I change uprob for fields that have few distinct values - for example box type has this distribution

PO BOX 641020
CP 23661

Default Agreement Default Dissagreement
MatchPrimaryName 20.13 -9.96
Postal 19.11 -5.64
CivicNumber 16.16 -5.64
CivicNumberSUffix 7.06 -3.31
StreetPrefixDir 3.16 -3.16
StreetPrefixType 6.58 -3.3
StreetName 16.98 -3.32
StreetSuffixType 2.14 -3.99
StreetSuffixQualifier
StreetSuffixDir 3.54 3.2
RRType 2.16 -2.99
RRValue 22.7 -6.64
BoxType 2.84 -3.12
BoxValue 18.62 -4.32
GD 1.43 -2.73
DellinstalationType 3.54 -3.2
DellInstalationName 21.66 -3.32
SiteType 4.84 -3.27
SiteValue 7.83 -3.31
CompType 0.84 -3.32
CompName 22.59 -3.64
LotType 0.84 -2.32
LotValue 6.57 -3.3
ConcType 0.84 -2.32
ConcValue 5.01 -3.28
UnitValue 8.23 -6.63

rjdickson · Post by **rjdickson** » Tue Apr 09, 2013 3:00 pm

Sorry for the late response, but..

It appears that the major problem you have is trying to account for the additional fields that can be populated in the two different address types you discuss.

May I suggest creating at least two passes: One that focuses on street address, and one that focuses on RR. The first pass would include components of street in the block (NYSIIS of street name, for example). That field would be null for RR addresses, so they would not be included. You could then focus on just street addresses in that pass. The second pass would include the RR type column, which would be null for a majority of the street addresses.

You may also want to add a pass that focuses on PO Box as well, as they are different (but similar) address constructs to RR.

It would help if you posted your blocking and matching fields for each pass.

I hope this helps!

stuartjvnorton · Post by **stuartjvnorton** » Tue Apr 09, 2013 5:30 pm

Definitely agree with Robert.

With addresses, you have couple of different sets of rules: street, RR and postal addresses.
It makes sense to split them into separate passes, so you can tune each one appropriately.

mmaric · Post by **mmaric** » Wed Apr 10, 2013 9:03 am

Thank you guys, I am leaning towards that solution myself.
Robert I was blocking on FSA - first 3 characters of postal code and custom built address key, This address key was build using phonetic streeet name appended with RR type RR value, BOX Type BOX value.
There are 4 type of addresses Street , RR, BOX, and GD .
Most of the time each one populates distinct columns in standardization but there are some that are shared among all types. I looked at distribution of values for each standardization column across address types in order to figure what columns to include into matching.
Would you recommend splitting data before matching based on address type and process each part with dedicated match specifications, or making multiple passes and use blocking to determine what gets processed at each pass?

Thanks again this was very helpful

mmaric · Post by **mmaric** » Thu Apr 11, 2013 11:09 am

I have created multiple passes one for each address type as suggested by Robert.
Blocking columns are
first pass streetname NYSIIS and FSA
second pass FSA, RR Type and RR value
third pass FSA, Box Type and Box Value

This made thinks much better. Still there are some issues to deal with meaning I need to finetune street portion of matching. At some point I have been using machname hashkey as blocking coulumn but it did not work to well due to names being reported incosistent.
Thanks again everyone