Multiple token standardization

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Multiple token standardization

Post by hitmanthesilentassasin »

Hi,

I am aware that multiple tokens can be handled via manual programming. However, I am looking for an alternate way to manual programming as I have a very huge list of multiple tokens in thousands to be standardized to their corresponding values. I know it works when I have to standardize a single token to multiple words in classification table but how to make it work the other way round?

Thanks!!
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

Can you please provide a few example of what the input would look like, and what you would like the output to be? If the tokens you are looking for are actually part of a longer string of tokens, then please provide that context in the example, too.

I think I understand what you are trying to do, but want to make sure so as to not assume :lol:
Regards,
Robert
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Hi Robert - I am trying to standardize cities and suburb names, to have it handled as a single token I have to concatenate the string within the code and then retype it to the classification I am looking for. But the concern here is since the number of names with multiple tokens are too many the patterns would go in 1000s.

For Example: to identify New York I have to use the combination of "New" followed by "York" and the concat the 2 tokens then retype it to the city classification.Only if I could classify "New York" to a specific classification code using classification table, I dont have to retype for all the suburbs and cities with multiple names.

Do you know any trick that can be applied here?
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Take a look at USPREP. It has a table called 'USCITIES.TBL' that is used in a way that sounds similar to your requirement. The Pattern Action Language looks for two or four tokens to look up. Cities like 'NEW YORK' and 'ALAMO HEIGHTS' and 'AVON BY THE SEA' are found by the rule set.

Basically, it has Pattern Action Language that handles 1, 2, 3, or 4 word city names based on the USCITIES table.

You should be able to use this as a technique.

I hope this helps!
Regards,
Robert
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Thanks Robert.

This is very close to what I was looking for but not the same.

With the help of the the table I am able to validate if the city is present or not. But I cant process it any further like suppress the city name. I think I missed to mention at the beginning that the standardization is part of the organizational names. where in I would want to suppress the names at specific occurrence.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Are you saying you have the name of a branch office included in the company name?

If you can find it, you can move it somewhere safe and then push the rest of the name through the rest of the name parse/stan.
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Yes, I am trying to standardize branches and franchises to the same name. I cant unplug all the names because some of suburb names are part of the name itself. Hence, I need to have the names tokenized so as to be able to identify if the given suburb is part of the org name or a location indicator.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

This is the real issue.
Working out *when* you should take it out is harder than working out *how* to take it out.
Post Reply