QualityStage - How to match string with spaces

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
kenstorm
Premium Member
Premium Member
Posts: 40
Joined: Mon Oct 16, 2006 2:33 pm

QualityStage - How to match string with spaces

Post by kenstorm »

I am trying to match product descriptions to a clean list of product descriptions. How can I do this? Since everything gets broken down into tokens, I can't do a long string match. I HAVE to match this way. I can't standardize using individual tokens due to multiple companies.

Raw data: ABC company Pepsi 12 pk
DEF company Pep twelve pax
Reference: ABC company Pepsi 12 pack

Thank you for your help. It us appreciated
Regards - Ken Breetz

Keep me in the circle of trust
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Let's get you over into the dedicated QualityStage forum... there you go.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kenstorm
Premium Member
Premium Member
Posts: 40
Joined: Mon Oct 16, 2006 2:33 pm

Post by kenstorm »

Thanks Chulett!

Here is some of the sample reference data:
"SCHLAFLY OKTOBERFST 1/6","Schlafly Oktoberfest 1/6 barrel keg (5.16 gal)",800
"SCHLAFLY OKTOBERFEST 50","Schlafly Oktoberfest 50 liter keg (13.2 gal)",800
"SCHLAFLY OKTOBRFEST 4/6","Schlafly Oktoberfest 12 ounce bottle - 6 pack",800
"SCHLAFLY OKTBRFEST 12PK","Schlafly Oktoberfest 12 ounce bottle - 12 pack",800

I need to match the first field with the input raw data then return the second field from the reference data.
Regards - Ken Breetz

Keep me in the circle of trust
kenstorm
Premium Member
Premium Member
Posts: 40
Joined: Mon Oct 16, 2006 2:33 pm

Post by kenstorm »

I found a way to perform this function. By taking the space out of STRIPLIST, it doesn't break the string into separate tokens. I am able to perform a lookup and return data using threshold settings.

BEFORE
\PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|\\:;\"<>,.?/'"
STRIPLIST " "
\PRAGMA_END

AFTER
\PRAGMA_START
SEPLIST "~`!@#$%^&*()-+={}[]|:;\"<>,.?'"
STRIPLIST "~"
\PRAGMA_END
Regards - Ken Breetz

Keep me in the circle of trust
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So... resolved?
-craig

"You can never have too many knives" -- Logan Nine Fingers
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

There is not an out of the box standardization rule set for products or items, so I would:
- Develop a custom rule set to standardize the products raw input data
- Design matching specification with the help of the SME for the products (He will know more about the classification/taxonomy of.the products)
- Develop the QS standardization process to use the new custom rule set and output the standardized products definitions
- Develop the matching process to get the reference match that you are lookingfor

Normally you don't need to worry about spaces and you don't want to try to match using the entire product definition

Regards
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
kenstorm
Premium Member
Premium Member
Posts: 40
Joined: Mon Oct 16, 2006 2:33 pm

Post by kenstorm »

Thank you JRodriguez.

The challenge here is there are over 500 distributors sending product descriptions from thousands of suppliers who constantly change the unit names and amounts. One might send "12 pk" which the SME will know as 12 pack of bottles. Another distributor can send "12 pk" which will be 12 pack of cans. By including the supplier name and product name in the logic, the logic can match specifically to get the units and amounts. If I tokenize everything, I would still need logic to choose the correct rule for the supplier.

I have another question in a different post about how to write to the reference lookup after the SME approves new incoming data that has a low matching threshold. This system will need to have approval based logic since this data changes so much.

I truly appreciate your help
Regards - Ken Breetz

Keep me in the circle of trust
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

I understand the challenge, and can assure you that is a normal situation. This is where QS is a great tool to match records where there are not common keys, but as I mentioned you need a good SME for the knowledge about their data. The goal should be to define the standards, how you want the gold master data to be organized and then work your way out from the raw input data from your different vendors to your established and published standards.

A well defined taxonomy by the company or SME would be capable of uniquely identify each product, every form of package, and other qualifiers for.the products. Also it should have policies well defined for conversions like packs to units, boxes to units, etc.

The tokenization and the smart picking of the match fields and the multiple passes, including one with the vendor code, will be key to get the expected results...for the this you would need to work out with the SME . There are not sample of a similar process but you could follow a sample of how to get a master data or consolidation of financial
customer from IBM developerworks site, that should give you a good base to start

As per the low quality records that the process did not get a match, you have different alternatives depending on the tools available at your site, but commonly you would have an exception process where you would reprocess records that has been corrected. In version 11.x the Information Governance Catalog + Stewardship allow you to define business processes to properly address the low quality records by assigning a steward to correct the records and send them back as a regular input or exception file to the process. Pre version 11.x you would normally write to CSV file so you could give it to the SME to amend/correct the records. After you would normally reprocess them in your process.... There are other bits that you would need to do to set your process to learn about the bad records or tokens so you don't need to ask the SME to correct the same records every day

You could also use Information Analyzer quality rules before and after your standardization+Matching process to monitor the effectiveness of the process
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
kenstorm
Premium Member
Premium Member
Posts: 40
Joined: Mon Oct 16, 2006 2:33 pm

Post by kenstorm »

Thank you JRodriguez

I agree with your approach and everything you mentioned. The company already has the cleansed target data that all customers need to match to. My first deliverable is to create a reference table that has all the raw data and its cleansed target equivalent. So this is a many to one relationship. Many raw records to one cleansed target record. I want to pre-scrub and standardize the data before attempting to match on the clean data. So this would be changing all abbreviations to be consistent. What is the best way to do this?

Also, I need to expose these reference lookups to the business so they can adjust them from time to time. I looked into Business Glossary and didn't how that can be usable.

Once again, thank you so much for helping me. Can't tell you how much i appreciate it.
Regards - Ken Breetz

Keep me in the circle of trust
Post Reply