Page 1 of 1

Posted: Mon Sep 29, 2008 2:36 am
by ray.wurlod
QualityStage does have Soundex, but you are not using it as your blocking columns. Did you intend to? Soundex is Soundex, no matter which vendor it comes from; the algorithm is in the public domain.

NYSIIS is a more sophisticated algorithm than Soundex (primarily because it looks at more characters and does not lose the vowels), and could not reasonably be expected to generate the same results as Soundex - otherwise why bother having two algorithms at all?

Posted: Mon Sep 29, 2008 3:53 am
by marionpt
Hi,

In our version (8.0.1) there is no Soundex, we only have Reverse Soundex (RVSNDX) and NYSIIS. (I will ask the IBM support regarding this).

I didn't use that because there are Last Name ends with different letter e.g. "Fernando Gonzales" should match with "Fernando Gonzalez"

I use QS to get the most accurate results. And I use Soundex for testing purposes only. Soundex result sometimes not that accurate e.g. "John Garcia" match with "Jane Garso".

In my understanding, the "Residual" are the unique records. But when I tested the result (using the Soundex), I found several duplicates e.g.
Customer# FirstName LastName State
752 Michelle Cortel Texas
254 Michelle Cortel Texas

Even though I dont use the soundex, I could find duplicates in Residual.

Thanks!

Posted: Mon Sep 29, 2008 4:19 am
by ray.wurlod
What aggregate weights did you get for these "duplicates"? How common or rare is "Michelle" in your data?

Posted: Fri Jul 23, 2010 7:02 am
by cristty
Are you sure it's a good idea to have all 3 columns both present for blocking and for matching?

Posted: Fri Jul 23, 2010 3:45 pm
by ray.wurlod
marionpt wrote:In our version (8.0.1) there is no Soundex, we only have Reverse Soundex (RVSNDX) and NYSIIS.
You DO have Soundex - it's just that the rule set you're using does not use it to generate any matching field. You could create a custom rule set that does use it, if that's what you want.

Posted: Mon Jul 26, 2010 11:55 pm
by stuartjvnorton
cristty wrote:Are you sure it's a good idea to have all 3 columns both present for blocking and for matching?
Yeah, seems kind of pointless.

Blocking fields should be there to weed out the junk, and matching fields are to differentiate the ones that pass the block by giving them a score.
If you match on exactly the same fields as the block, then everything that makes it through the match pass will be an exact match (on the loose key) and the only score differentiation will be because of less common sounding terms as per the frequency analysis.

Even if the blocking fields and matching fields aren't the same, you want to check against actual values (or tight keys) in the match pass, so it can score the ones that are exact (or almost) higher than the ones that just sneak through via the NYSIIS.

Exact checking on loose keys = blocking
Fuzzy checks on loose keys = minimal matching value
Fuzzy checks on exact values = better matching value