Qualitystage Reference Match weights

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
mmaric
Premium Member
Premium Member
Posts: 20
Joined: Mon Nov 30, 2009 11:44 am

Qualitystage Reference Match weights

Post by mmaric »

Hi all,
I am in need of a little help. I appologize if this topic was alredy disussed I did search but could not find my answer.
We are trying to match incomming data to our master record set in database.
Database contains bussiness names along with address and phone information.
We have designed mathing specification with 2 passes , First pass does blocking on the phone number and second pass blocks on FSA (part of the postal code ) and custom created addresskey. All works as expected but with some issues. Lot of small companies often share same address (industrial part of the city .. ) and may has some similarity in the name,
For example Canada Foods INC would match Canada Records with small agreement weight,
All data is standardized prior matching and proper match frequiency is created. My question is how to favor on coulum more than otehers in composite weight. I wuld like name to contribute more to the overal weight then address data. We added override for the name dissagreement but in this case it does not help because name is matching to certain degree.
Would scaled override do good job or shoud I use total override?
Second pass matche on all standardized address coulumns and standardized name match.
Thanks all, Marinko
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I find it easier to explain scaled (multiplier) weight overrides to the business. An override greater than 1 adds influence for that column in the composite weight; an override less than 1 reduces that column's influence.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mmaric
Premium Member
Premium Member
Posts: 20
Joined: Mon Nov 30, 2009 11:44 am

Post by mmaric »

Thank Ray, Can not see your post yet (just paid my membership few minutes ago). Here are some other considerations I have.
I did separate testing on name matching and name alone is getting weights between 5 and 20. Around 10 I am getting dependable match. In my match I am comparing all elements of standardized address. So this is a bit of issue to me as having address of RR 1 and address of 13A King Street East Unit 11B will produce considerable different composite weights due to more elements to compare in second address. This makes it harder to define cutoff weight. Here is some thinking.
Leave match as is set cutoff weight high to get good matches
create separate match for clerical records. Use qsmatchsetid to block and compare names only. Set cutoff to get reliable matches for the name, Consider address type??
Anther thinking is to separate records prior to matching based on address type and have separate matches. Same matching specs can be used but with different cutoff weight.
I am looking forward to hear from you and other members that have more experience on this topic as it is pretty new to me.
Thanks
mmaric
Premium Member
Premium Member
Posts: 20
Joined: Mon Nov 30, 2009 11:44 am

Post by mmaric »

Just found this:

http://www-01.ibm.com/support/docview.w ... wg21573053

Since I am using MULT_UNCERT overides will not work.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

mmaric wrote:In my match I am comparing all elements of standardized address. So this is a bit of issue to me as having address of RR 1 and address of 13A King Street East Unit 11B will produce considerable different composite weights due to more elements to compare in second address. This makes it harder to define cutoff weight. Here is some thinking. Leave match as is set cutoff weight high to get good matches create separate match for clerical records. Use qsmatchsetid to block and compare names only. Set cutoff to get reliable matches for the name, Consider address type??
That can be an issue for matching standardised addresses, where the absence of "optional" fields can have actual meaning.

Ray has already discussed weighting fields based on importance. This still may not help where you can get more for a non-match than a match. eg: 2 units in the same building where the unit number doesn't match still get a higher match score than 2 houses that match exactly.

An option is to use weight overrides to impose a larger negative value for disagreement on these optional fields. Another is to take the optional fields and give them a nominal score where they are absent for both source and reference.
mmaric
Premium Member
Premium Member
Posts: 20
Joined: Mon Nov 30, 2009 11:44 am

Post by mmaric »

Thanks Stuart,
I have looked at defalut agreement and disagreement values and they are as bellow. I guess StreetSuffixQualifier can be removed as there is no values at all in the data. I would appreciate if you coud reccomend some overrides. Also would it help if I change uprob for fields that have few distinct values - for example box type has this distribution

PO BOX 641020
CP 23661



Default Agreement Default Dissagreement
MatchPrimaryName 20.13 -9.96
Postal 19.11 -5.64
CivicNumber 16.16 -5.64
CivicNumberSUffix 7.06 -3.31
StreetPrefixDir 3.16 -3.16
StreetPrefixType 6.58 -3.3
StreetName 16.98 -3.32
StreetSuffixType 2.14 -3.99
StreetSuffixQualifier
StreetSuffixDir 3.54 3.2
RRType 2.16 -2.99
RRValue 22.7 -6.64
BoxType 2.84 -3.12
BoxValue 18.62 -4.32
GD 1.43 -2.73
DellinstalationType 3.54 -3.2
DellInstalationName 21.66 -3.32
SiteType 4.84 -3.27
SiteValue 7.83 -3.31
CompType 0.84 -3.32
CompName 22.59 -3.64
LotType 0.84 -2.32
LotValue 6.57 -3.3
ConcType 0.84 -2.32
ConcValue 5.01 -3.28
UnitValue 8.23 -6.63
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Sorry for the late response, but..

It appears that the major problem you have is trying to account for the additional fields that can be populated in the two different address types you discuss.

May I suggest creating at least two passes: One that focuses on street address, and one that focuses on RR. The first pass would include components of street in the block (NYSIIS of street name, for example). That field would be null for RR addresses, so they would not be included. You could then focus on just street addresses in that pass. The second pass would include the RR type column, which would be null for a majority of the street addresses.

You may also want to add a pass that focuses on PO Box as well, as they are different (but similar) address constructs to RR.

It would help if you posted your blocking and matching fields for each pass.

I hope this helps!
Regards,
Robert
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Definitely agree with Robert.

With addresses, you have couple of different sets of rules: street, RR and postal addresses.
It makes sense to split them into separate passes, so you can tune each one appropriately.
mmaric
Premium Member
Premium Member
Posts: 20
Joined: Mon Nov 30, 2009 11:44 am

Post by mmaric »

Thank you guys, I am leaning towards that solution myself.
Robert I was blocking on FSA - first 3 characters of postal code and custom built address key, This address key was build using phonetic streeet name appended with RR type RR value, BOX Type BOX value.
There are 4 type of addresses Street , RR, BOX, and GD .
Most of the time each one populates distinct columns in standardization but there are some that are shared among all types. I looked at distribution of values for each standardization column across address types in order to figure what columns to include into matching.
Would you recommend splitting data before matching based on address type and process each part with dedicated match specifications, or making multiple passes and use blocking to determine what gets processed at each pass?

Thanks again this was very helpful
mmaric
Premium Member
Premium Member
Posts: 20
Joined: Mon Nov 30, 2009 11:44 am

Post by mmaric »

I have created multiple passes one for each address type as suggested by Robert.
Blocking columns are
first pass streetname NYSIIS and FSA
second pass FSA, RR Type and RR value
third pass FSA, Box Type and Box Value

This made thinks much better. Still there are some issues to deal with meaning I need to finetune street portion of matching. At some point I have been using machname hashkey as blocking coulumn but it did not work to well due to names being reported incosistent.
Thanks again everyone
Post Reply