DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
kenstorm



Group memberships:
Premium Members

Joined: 16 Oct 2006
Posts: 40

Points: 626

Post Posted: Fri Aug 25, 2017 11:00 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
I am trying to match product descriptions to a clean list of product descriptions. How can I do this? Since everything gets broken down into tokens, I can't do a long string match. I HAVE to match this way. I can't standardize using individual tokens due to multiple companies.

Raw data: ABC company Pepsi 12 pk
DEF company Pep twelve pax
Reference: ABC company Pepsi 12 pack

Thank you for your help. It us appreciated

_________________
Regards - Ken Breetz

Keep me in the circle of trust
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42097
Location: Denver, CO
Points: 216080

Post Posted: Fri Aug 25, 2017 11:36 am Reply with quote    Back to top    

Let's get you over into the dedicated QualityStage forum... there you go.

_________________
-craig

Can't keep my eyes from the circling skies
Tongue tied and twisted just an earth bound misfit, I
Rate this response:  
kenstorm



Group memberships:
Premium Members

Joined: 16 Oct 2006
Posts: 40

Points: 626

Post Posted: Fri Aug 25, 2017 12:22 pm Reply with quote    Back to top    

Thanks Chulett!

Here is some of the sample reference data:
"SCHLAFLY OKTOBERFST 1/6","Schlafly Oktoberfest 1/6 barrel keg (5.16 gal)",800
"SCHLAFLY OKTOBERFEST 50","Schlafly Oktoberfest 50 liter keg (13.2 gal)",800
"SCHLAFLY OKTOBRFEST 4/6","Schlafly Oktoberfest 12 ounce bottle - 6 pack",800
"SCHLAFLY OKTBRFEST 12PK","Schlafly Oktoberfest 12 ounce bottle - 12 pack",800

I need to match the first field with the input raw data then return the second field from the reference data.

_________________
Regards - Ken Breetz

Keep me in the circle of trust
Rate this response:  
kenstorm



Group memberships:
Premium Members

Joined: 16 Oct 2006
Posts: 40

Points: 626

Post Posted: Fri Aug 25, 2017 3:17 pm Reply with quote    Back to top    

I found a way to perform this function. By taking the space out of STRIPLIST, it doesn't break the string into separate tokens. I am able to perform a lookup and return data using threshold settings.

BEFORE
\PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|\\:;\"<>,.?/'"
STRIPLIST " "
\PRAGMA_END

AFTER
\PRAGMA_START
SEPLIST "~`!@#$%^&*()-+={}[]|:;\"<>,.?'"
STRIPLIST "~"
\PRAGMA_END

_________________
Regards - Ken Breetz

Keep me in the circle of trust
Rate this response:  
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42097
Location: Denver, CO
Points: 216080

Post Posted: Fri Aug 25, 2017 3:43 pm Reply with quote    Back to top    

So... resolved?

_________________
-craig

Can't keep my eyes from the circling skies
Tongue tied and twisted just an earth bound misfit, I
Rate this response:  
JRodriguez



Group memberships:
Premium Members

Joined: 19 Nov 2005
Posts: 396
Location: New York City
Points: 4268

Post Posted: Fri Aug 25, 2017 4:12 pm Reply with quote    Back to top    

There is not an out of the box standardization rule set for products or items, so I would:
- Develop a custom rule set to standardize the products raw input data
- Design matching specification with the help of the SME for the products (He will know more about the classification/taxonomy of.the products)
- Develop the QS standardization process to use the new custom rule set and output the standardized products definitions
- Develop the matching process to get the reference match that you are lookingfor

Normally you don't need to worry about spaces and you don't want to try to match using the entire product definition

Regards

_________________
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Rate this response:  
kenstorm



Group memberships:
Premium Members

Joined: 16 Oct 2006
Posts: 40

Points: 626

Post Posted: Sat Aug 26, 2017 8:53 am Reply with quote    Back to top    

Thank you JRodriguez.

The challenge here is there are over 500 distributors sending product descriptions from thousands of suppliers who constantly change the unit names and amounts. One might send "12 pk" which the SME will know as 12 pack of bottles. Another distributor can send "12 pk" which will be 12 pack of cans. By including the supplier name and product name in the logic, the logic can match specifically to get the units and amounts. If I tokenize everything, I would still need logic to choose the correct rule for the supplier.

I have another question in a different post about how to write to the reference lookup after the SME approves new incoming data that has a low matching threshold. This system will need to have approval based logic since this data changes so much.

I truly appreciate your help

_________________
Regards - Ken Breetz

Keep me in the circle of trust
Rate this response:  
JRodriguez



Group memberships:
Premium Members

Joined: 19 Nov 2005
Posts: 396
Location: New York City
Points: 4268

Post Posted: Sat Aug 26, 2017 11:57 am Reply with quote    Back to top    

I understand the challenge, and can assure you that is a normal situation. This is where QS is a great tool to match records where there are not common keys, but as I mentioned you need a good SME for the knowledge about their data. The goal should be to define the standards, how you want the gold master data to be organized and then work your way out from the raw input data from your different vendors to your established and published standards.

A well defined taxonomy by the company or SME would be capable of uniquely identify each product, every form of package, and other qualifiers for.the products. Also it should have policies well defined for conversions like packs to units, boxes to units, etc.

The tokenization and the smart picking of the match fields and the multiple passes, including one with the vendor code, will be key to get the expected results...for the this you would need to work out with the SME . There are not sample of a similar process but you could follow a sample of how to get a master data or consolidation of financial
customer from IBM developerworks site, that should give you a good base to start

As per the low quality records that the process did not get a match, you have different alternatives depending on the tools available at your site, but commonly you would have an exception process where you would reprocess records that has been corrected. In version 11.x the Information Governance Catalog + Stewardship allow you to define business processes to properly address the low quality records by assigning a steward to correct the records and send them back as a regular input or exception file to the process. Pre version 11.x you would normally write to CSV file so you could give it to the SME to amend/correct the records. After you would normally reprocess them in your process.... There are other bits that you would need to do to set your process to learn about the bad records or tokens so you don't need to ask the SME to correct the same records every day

You could also use Information Analyzer quality rules before and after your standardization+Matching process to monitor the effectiveness of the process

_________________
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Rate this response:  
kenstorm



Group memberships:
Premium Members

Joined: 16 Oct 2006
Posts: 40

Points: 626

Post Posted: Wed Sep 13, 2017 7:56 am Reply with quote    Back to top    

Thank you JRodriguez

I agree with your approach and everything you mentioned. The company already has the cleansed target data that all customers need to match to. My first deliverable is to create a reference table that has all the raw data and its cleansed target equivalent. So this is a many to one relationship. Many raw records to one cleansed target record. I want to pre-scrub and standardize the data before attempting to match on the clean data. So this would be changing all abbreviations to be consistent. What is the best way to do this?

Also, I need to expose these reference lookups to the business so they can adjust them from time to time. I looked into Business Glossary and didn't how that can be usable.

Once again, thank you so much for helping me. Can't tell you how much i appreciate it.

_________________
Regards - Ken Breetz

Keep me in the circle of trust
Rate this response:  
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours