Multiple token standardization

hitmanthesilentassasin · Mon Nov 10, 2014 5:41 pm

Hi,

I am aware that multiple tokens can be handled via manual programming. However, I am looking for an alternate way to manual programming as I have a very huge list of multiple tokens in thousands to be standardized to their corresponding values. I know it works when I have to standardize a single token to multiple words in classification table but how to make it work the other way round?

Thanks!!

rjdickson · Post by **rjdickson** » Mon Nov 10, 2014 6:13 pm

Hi,

Can you please provide a few example of what the input would look like, and what you would like the output to be? If the tokens you are looking for are actually part of a longer string of tokens, then please provide that context in the example, too.

I think I understand what you are trying to do, but want to make sure so as to not assume

hitmanthesilentassasin · Mon Nov 10, 2014 7:38 pm

Hi Robert - I am trying to standardize cities and suburb names, to have it handled as a single token I have to concatenate the string within the code and then retype it to the classification I am looking for. But the concern here is since the number of names with multiple tokens are too many the patterns would go in 1000s.

For Example: to identify New York I have to use the combination of "New" followed by "York" and the concat the 2 tokens then retype it to the city classification.Only if I could classify "New York" to a specific classification code using classification table, I dont have to retype for all the suburbs and cities with multiple names.

Do you know any trick that can be applied here?

rjdickson · Post by **rjdickson** » Wed Nov 12, 2014 1:49 am

Take a look at USPREP. It has a table called 'USCITIES.TBL' that is used in a way that sounds similar to your requirement. The Pattern Action Language looks for two or four tokens to look up. Cities like 'NEW YORK' and 'ALAMO HEIGHTS' and 'AVON BY THE SEA' are found by the rule set.

Basically, it has Pattern Action Language that handles 1, 2, 3, or 4 word city names based on the USCITIES table.

You should be able to use this as a technique.

I hope this helps!

hitmanthesilentassasin · Wed Nov 12, 2014 11:55 pm

Thanks Robert.

This is very close to what I was looking for but not the same.

With the help of the the table I am able to validate if the city is present or not. But I cant process it any further like suppress the city name. I think I missed to mention at the beginning that the standardization is part of the organizational names. where in I would want to suppress the names at specific occurrence.

stuartjvnorton · Post by **stuartjvnorton** » Thu Nov 13, 2014 4:53 pm

Are you saying you have the name of a branch office included in the company name?

If you can find it, you can move it somewhere safe and then push the rest of the name through the rest of the name parse/stan.

hitmanthesilentassasin · Thu Nov 13, 2014 5:34 pm

Yes, I am trying to standardize branches and franchises to the same name. I cant unplug all the names because some of suburb names are part of the name itself. Hence, I need to have the names tokenized so as to be able to identify if the given suburb is part of the org name or a location indicator.

stuartjvnorton · Post by **stuartjvnorton** » Thu Nov 13, 2014 11:32 pm

This is the real issue.
Working out *when* you should take it out is harder than working out *how* to take it out.