override string similarity function

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
surfsup
Participant
Posts: 18
Joined: Thu Apr 23, 2009 8:43 am

override string similarity function

Post by surfsup »

Howdy,

I know this is a long shot and probably not possible (in an IBM approved manner), but is there any way to override the string comparison function used in the classification?

E.g.
Word of HERTFORDSHIRE has a tolerance of 700, but it will not match against IERTFORDSHIRE. It will happily match against misspells in the middle of the word though.


Cheers,
A
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

I doubt that IBM will change the existing algorithms as they would impact many, many existing customers.

My testing did show that they did match, but barely.

However, this is an example of something that may be fixable in Standardization. Can you verify the address? This would presumably change the city name to the correct name. If no verification is in play, can you Standardized it to the correct spelling?

In other words - this may be a Standardization challenge versus a Matching challenge.
Regards,
Robert
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

For example, these two would match using Reverse Soundex, even without any other standardisation.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
surfsup
Participant
Posts: 18
Joined: Thu Apr 23, 2009 8:43 am

Post by surfsup »

Hi Rj,

This was just an example; there are any number of possible individual errors that appear in the source and provisioning for all their various combinations in the standardisation phase would be much more costly (both in time and resources) than enhancing the standardisation function (I don't expect IBM to change the existing functionality in the product).


Hi Ray,

Reverse Soundex on these individual words would work, but the source data contains various other errors (letters as number, numbers as individual or groups of letters and letters replacements ) within the same word at the begining, middle and/or end.

My first stop to manage these errors would be to augument the piece of code that QS uses for classification of tokens (and hence write less standardisation rules and reuse more of the existing rule sets).
Post Reply