DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
mmaric



Group memberships:
Premium Members

Joined: 30 Nov 2009
Posts: 20

Points: 221

Post Posted: Fri Apr 05, 2013 12:22 pm Reply with quote    Back to top    

DataStage® Release: 8x
Job Type: Parallel
OS: Unix
Additional info: How to favor certain column?
Hi all,
I am in need of a little help. I appologize if this topic was alredy disussed I did search but could not find my answer.
We are trying to match incomming data to our master record set in database.
Database contains bussiness names along with address and phone information.
We have designed mathing specification with 2 passes , First pass does blocking on the phone number and second pass blocks on FSA (part of the postal code ) and custom created addresskey. All works as expected but with some issues. Lot of small companies often share same address (industrial part of the city .. ) and may has some similarity in the name,
For example Canada Foods INC would match Canada Records with small agreement weight,
All data is standardized prior matching and proper match frequiency is created. My question is how to favor on coulum more than otehers in composite weight. I wuld like name to contribute more to the overal weight then address data. We added override for the name dissagreement but in this case it does not help because name is matching to certain degree.
Would scaled override do good job or shoud I use total override?
Second pass matche on all standardized address coulumns and standardized name match.
Thanks all, Marinko
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54479
Location: Sydney, Australia
Points: 295436

Post Posted: Fri Apr 05, 2013 3:04 pm Reply with quote    Back to top    

I find it easier to explain scaled (multiplier) weight overrides to the business. An override greater than 1 adds influence for that column in the composite weight; an override less than 1 reduces tha ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
Rate this response:  
Not yet rated
mmaric



Group memberships:
Premium Members

Joined: 30 Nov 2009
Posts: 20

Points: 221

Post Posted: Mon Apr 08, 2013 6:21 am Reply with quote    Back to top    

Thank Ray, Can not see your post yet (just paid my membership few minutes ago). Here are some other considerations I have.
I did separate testing on name matching and name alone is getting weights between 5 and 20. Around 10 I am getting dependable match. In my match I am comparing all elements of standardized address. So this is a bit of issue to me as having address of RR 1 and address of 13A King Street East Unit 11B will produce considerable different composite weights due to more elements to compare in second address. This makes it harder to define cutoff weight. Here is some thinking.
Leave match as is set cutoff weight high to get good matches
create separate match for clerical records. Use qsmatchsetid to block and compare names only. Set cutoff to get reliable matches for the name, Consider address type??
Anther thinking is to separate records prior to matching based on address type and have separate matches. Same matching specs can be used but with different cutoff weight.
I am looking forward to hear from you and other members that have more experience on this topic as it is pretty new to me.
Thanks
Rate this response:  
Not yet rated
mmaric



Group memberships:
Premium Members

Joined: 30 Nov 2009
Posts: 20

Points: 221

Post Posted: Mon Apr 08, 2013 12:46 pm Reply with quote    Back to top    

Just found this:

http://www-01.ibm.com/support/docview.wss?uid=swg21573053

Since I am using MULT_UNCERT overides will not work.
Rate this response:  
Not yet rated
stuartjvnorton
Participant



Joined: 19 Apr 2007
Posts: 518
Location: Melbourne
Points: 3853

Post Posted: Mon Apr 08, 2013 7:25 pm Reply with quote    Back to top    

mmaric wrote:
In my match I am comparing all elements of standardized address. So this is a bit of issue to me as having address of RR 1 and address of 13A King Street East Unit 11B will produce considerable different composite weights due to more elements to compare in second address. This makes it harder to define cutoff weight. Here is some thinking. Leave match as is set cutoff weight high to get good matches create separate match for clerical records. Use qsmatchsetid to block and compare names only. Set cutoff to get reliable matches for the name, Consider address type??

That can be an issue for matching standardised addresses, where the absence of "optional" fields can have actual meaning.

Ray has already discussed weighting fields based on importance. This still may not help where you can get more for a non-match than a match. eg: 2 units in the same building where the unit number doesn't match still get a higher match score than 2 houses that match exactly.

An option is to use weight overrides to impose a larger negative value for disagreement on these optional fields. Another is to take the optional fields and give them a nominal score where they are absent for both source and reference.
Rate this response:  
Not yet rated
mmaric



Group memberships:
Premium Members

Joined: 30 Nov 2009
Posts: 20

Points: 221

Post Posted: Tue Apr 09, 2013 7:24 am Reply with quote    Back to top    

Thanks Stuart,
I have looked at defalut agreement and disagreement values and they are as bellow. I guess StreetSuffixQualifier can be removed as there is no values at all in the data. I would appreciate if you coud reccomend some overrides. Also would it help if I change uprob for fields that have few distinct values - for example box type has this distribution

PO BOX 641020
CP 23661



Default Agreement Default Dissagreement
MatchPrimaryName 20.13 -9.96
Postal 19.11 -5.64
CivicNumber 16.16 -5.64
CivicNumberSUffix 7.06 -3.31
StreetPrefixDir 3.16 -3.16
StreetPrefixType 6.58 -3.3
StreetName 16.98 -3.32
StreetSuffixType 2.14 -3.99
StreetSuffixQualifier
StreetSuffixDir 3.54 3.2
RRType 2.16 -2.99
RRValue 22.7 -6.64
BoxType 2.84 -3.12
BoxValue 18.62 -4.32
GD 1.43 -2.73
DellinstalationType 3.54 -3.2
DellInstalationName 21.66 -3.32
SiteType 4.84 -3.27
SiteValue 7.83 -3.31
CompType 0.84 -3.32
CompName 22.59 -3.64
LotType 0.84 -2.32
LotValue 6.57 -3.3
ConcType 0.84 -2.32
ConcValue 5.01 -3.28
UnitValue 8.23 -6.63
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Tue Apr 09, 2013 3:00 pm Reply with quote    Back to top    

Sorry for the late response, but..

It appears that the major problem you have is trying to account for the additional fields that can be populated in the two different address types you discuss.

May I suggest creating at least two passes: One that focuses on street address, and one that focuses on RR. The first pass would include components of street in the block (NYSIIS of street name, for example). That field would be null for RR addresses, so they would not be included. You could then focus on just street addresses in that pass. The second pass would include the RR type column, which would be null for a majority of the street addresses.

You may also want to add a pass that focuses on PO Box as well, as they are different (but similar) address constructs to RR.

It would help if you posted your blocking and matching fields for each pass.

I hope this helps!

_________________
Regards,
Robert
Rate this response:  
Not yet rated
stuartjvnorton
Participant



Joined: 19 Apr 2007
Posts: 518
Location: Melbourne
Points: 3853

Post Posted: Tue Apr 09, 2013 5:30 pm Reply with quote    Back to top    

Definitely agree with Robert.

With addresses, you have couple of different sets of rules: street, RR and postal addresses.
It makes sense to split them into separate passes, so you can tune each one appropriately.
Rate this response:  
Not yet rated
mmaric



Group memberships:
Premium Members

Joined: 30 Nov 2009
Posts: 20

Points: 221

Post Posted: Wed Apr 10, 2013 9:03 am Reply with quote    Back to top    

Thank you guys, I am leaning towards that solution myself.
Robert I was blocking on FSA - first 3 characters of postal code and custom built address key, This address key was build using phonetic streeet name appended with RR type RR value, BOX Type BOX value.
There are 4 type of addresses Street , RR, BOX, and GD .
Most of the time each one populates distinct columns in standardization but there are some that are shared among all types. I looked at distribution of values for each standardization column across address types in order to figure what columns to include into matching.
Would you recommend splitting data before matching based on address type and process each part with dedicated match specifications, or making multiple passes and use blocking to determine what gets processed at each pass?

Thanks again this was very helpful
Rate this response:  
Not yet rated
mmaric



Group memberships:
Premium Members

Joined: 30 Nov 2009
Posts: 20

Points: 221

Post Posted: Thu Apr 11, 2013 11:09 am Reply with quote    Back to top    

I have created multiple passes one for each address type as suggested by Robert.
Blocking columns are
first pass streetname NYSIIS and FSA
second pass FSA, RR Type and RR value
third pass FSA, Box Type and Box Value

This made thinks much better. Still there are some issues to deal with meaning I need to finetune street portion of matching. At some point I have been using machname hashkey as blocking coulumn but it did not work to well due to names being reported incosistent.
Thanks again everyone
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours