DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
anand_chal



Group memberships:
Premium Members

Joined: 28 May 2011
Posts: 7

Points: 84

Post Posted: Mon Jul 23, 2018 9:03 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
I am new to Match Stage and i am trying to get the hands on with it.
I started working with the examples provided by IBM for Match Stage from the IBM site.

Well I am trying to find how the agreement and disagreement values are calculated for individual fields. In the documentation they have given the formula as follows:
Agreement score: log2(m probability / u probability)
Disagreement score: log2((1 - m probability)/(1 - u probability))

I started calculating individual field scores manually for a master record from the output of match pass from sample data.
The values are calculated for the master record, so the comparision is between the self and not comparing with any other record in the file.

Below is the table which shows scores from my calucaltion and match stage.
Note: For user readability please copy this table into excel and split the columns using space as delimiter.

Matching_field_names Comparision m-Probability u-Probability Param1 Agreement_weight Disagreement_weight Match/Notmatch/BLANK Manual_Scores Match_stage_scores
MatchPrimaryName_USNAME UNCERT 0.99 0.01 700 6.62935662 -6.62935662 Match 6.62935662 12.69
HouseNumber_USADDR CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 12.3
HouseNumberSuffix_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 Match 6.491853096 4.26
StreetPrefixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 13.1
StreetName_USADDR UNCERT 0.99 0.01 800 6.62935662 -6.62935662 Match 6.62935662 0
StreetSuffixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
RuralRouteValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BoxValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
FloorValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
UnitValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BuildingName_USADDR UNCERT 0.9 0.01 800 6.491853096 -3.307428525 BLANK 0 0
ZipCode_USAREA CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 10.17

Composite_Weights:
Expected_score Actual_Score_from_Match_stage
33.00927958 52.52


In the above table last two columns represent the manual vs scores from match stage.
There is a considerable difference in the composite weights after adding the individual scores between my calucation and match stage(33.009 vs 52.52).

Am I missing any thing in my calucation to get the accurate results?
What is the significance of Param1 value in the calculation?
There is not much documentation provided in the web. So I am approaching dsxchange to find my answers.

Any help is much appreciated. Thank you!
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54284
Location: Sydney, Australia
Points: 294434

Post Posted: Mon Jul 23, 2018 9:49 pm Reply with quote    Back to top    

Param1 has different interpretations for different matching algorithms. For UNCERT, for example, Param1 is the uncertainty threshold (somewhere between 700 and 900, which you CAN find in the document ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
anand_chal



Group memberships:
Premium Members

Joined: 28 May 2011
Posts: 7

Points: 84

Post Posted: Tue Jul 24, 2018 11:47 am Reply with quote    Back to top    

Thanks for your reply, Ray.

I searched documentation. I am not able to understand why there is a huge difference still. If there is anyother calculation required in addition to agreement and disagreement scores, why it is not documented any where in IBM documentation.

If you could provide me links I will go through them.

Thanks again!
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54284
Location: Sydney, Australia
Points: 294434

Post Posted: Wed Jul 25, 2018 1:10 am Reply with quote    Back to top    

I can't recall that it's anywhere in any of the documentation other than in the training course to which I alluded. Harald Smith has a DeveloperWorks article on how match weights are calculated ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours