Convert multiple tokens in standardization

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
evandal
Participant
Posts: 10
Joined: Wed Nov 08, 2006 12:03 pm
Location: Montreal,QC

Convert multiple tokens in standardization

Post by evandal »

Hi, I'm trying to convert a string of multiple tokens to a new class. I have tried lookups and conversion but neither is really doing what I need. I'm looking for some suggestions on how to proceed. I can't create a rule for each case becasue there are thousands, and in some cases, thes rules would undo other valid cases.

For example, "123 YOUNG DRIVE WEST NW". This gets tokenized as ^+DDD but I want to be able to find "YOUNG DRIVE WEST" in an exception table and convert it to the Street Name. I tried creating a class Z so that after doing a convert on this I would get ^ZD, which I could create a simple rule to handle.

So as a lookup, I tried
**| ? =@COMPLEX_NAME.TBL | **
RETYPE [2] Z

TBL contains:
"YOUNG DRIVE WEST"

Problem with this is that ? handles only untokenized words.

Next I tried

*?
CONVERT [1] @COMPLEX_NAME.TBL TKN

TBL contains:
"YOUNG DRIVE WEST" Z 800.0

That doesn't seem to work either. So does anyone have any ideas on how to handle this case.
Eric Vandal - CGI
evandal
Participant
Posts: 10
Joined: Wed Nov 08, 2006 12:03 pm
Location: Montreal,QC

Post by evandal »

Sorry, I meant to say "Problem with this is that ? handles only unclassified words. ".

So I need a class that can match multiple words, classified or not.
Eric Vandal - CGI
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

I've done something like what you're trying to do in the past, but it took 2 passes at each pattern (so you could prepare first time, then make the test and still have control over whether to use the result or not) and was pretty ugly.

From what I can see (and I'd love to be corrected on this), there is no nice way to do it.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

Let's try to approach the challenge from a different perspective....

First, I am a bit confused. You say that "123 YOUNG DRIVE WEST NW" generates a ^+DDD. I would expect (depending on the rule set) that 'DRIVE' would be a T, so the pattern would be ^+TDD. Which address rule set are you using, or do you have an override for 'DRIVE'?

Secondly, what is the problem you are trying to solve? Is it that 'WEST NW' are not being handled correctly? It might help to say what you want as an output.
Regards,
Robert
evandal
Participant
Posts: 10
Joined: Wed Nov 08, 2006 12:03 pm
Location: Montreal,QC

Post by evandal »

rjdickson wrote:First, I am a bit confused. You say that "123 YOUNG DRIVE WEST NW" generates a ^+DDD. I would expect (depending on the rule set) that 'DRIVE' would be a T, so the pattern would be ^+TDD. .
Yes, sorry that one is a typo. It does give me ^+TDD.

As for the result I expect, I need to have a street name = "Young Drive West" and a direction = NW. So I am looking for a way to have a list of exception streets that don't parse correctly under the current rules and fix them early. This is a simple example but I have hundreds of complex ones (especially in French) where a street name like "Chemin de la cote des Neiges" gets pasred out as T++T++. With long French names and alternate spellings with hyphens (T+-+-T-+-+ or T++T-+-+), the rules get too crazy and soon begin to override other valid cases.

So that's why I'm basically looking for a way to run the address string against a lookup and reclassify these complex names to a class Z for example. Then I just have to add a few rule to treat Z as a complex street name.

I need a class like ? but that will look at multiple words regardless of whether or not they are already classified. So I could compare T++, T-+-+ or +++ against the list and see if it is an exception street.

Is that explanation a bit clearer?
Eric Vandal - CGI
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

I think the main issue is that CONVERT doesn't quite seem to do what you need with multiple tokens.
TKN will do each token individually (which doesn't help), but does the rest of it right.
TEMP will concatenate your tokens and check them against a concatenated lookup as per p34 of the PAR, but the result isn't permanent, so you can't do it up front and handle it cleanly later: it has to be actioned NOW.

Ideally you want something in the middle: you want the permanent change and optional retype of TKN, but with the concatenated token lookup of the TEMP.

I think you would also need to understand your boundary tokens. I haven't tried a floating ** before, which is kind of what you'd need if you wanted to do it in one magic call. Otherwise you'll need to understand your boundary markers to set limits on the **.

That said, you can do it yourself, like I said before.

You'd also want to clean up any of the punctuation etc before you tried this.

Might also be worthwhile to understand how the FRADDR does it and see if you can apply some of their methods.
Post Reply