DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
kevink
Participant



Joined: 16 Oct 2013
Posts: 7
Location: Sydney
Points: 83

Post Posted: Wed Oct 16, 2013 2:19 pm Reply with quote    Back to top    

DataStage® Release: 8x
Job Type: Parallel
OS: Unix
Hi,

I've got some really dirty Australian address data, with lots of "added" information. The entire address is in a single VarChar field with no consistent format.

Can you suggest ideas for cleansing the data (using DataStage, with which I am quite familiar) or QualityStage? I am permitted to create a copy of the rule sets if necessary, and even to modify the pattern-action script.

Thanks,
Kevin

_________________
Kevin K Tashadow
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54431
Location: Sydney, Australia
Points: 295208

Post Posted: Wed Oct 16, 2013 3:18 pm Reply with quote    Back to top    

Welcome aboard.

What have you tried? How successful was it?

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne (especially seeking good business analysts)
Rate this response:  
Not yet rated
kevink
Participant



Joined: 16 Oct 2013
Posts: 7
Location: Sydney
Points: 83

Post Posted: Wed Oct 16, 2013 6:32 pm Reply with quote    Back to top    

I have a DataStage job that splits the string at the first street type token in the data; the token and all preceding it go into the addr bucket while everything following the token goes into the area bucket. If no street type token exists then all the data go into the area bucket.

Using character investigation (as taught in the QualityStage training) I have identified over 100 input patterns for the AUAREA rule set that lead to unhandled data. In only a small proportion of these does the unhandled pattern match the input pattern. I am thinking to incorporate all of these in a new AREA rule set, but that may make it cumbersomely large, with associated performance penalty.

That's why I'm looking for ideas.

_________________
Kevin K Tashadow
Rate this response:  
Not yet rated
vmcburney

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup

Joined: 23 Jan 2003
Posts: 3590
Location: Australia, Melbourne
Points: 28109

Post Posted: Wed Oct 16, 2013 10:23 pm Reply with quote    Back to top    

Well, this is what QualityStage is for. Breaking the string fields into tokens and standardising it. Try QualityStage Investigation and Standardise steps. You could spend a year writing DataStage code to parse all the combinations, you need QualityStage.

_________________
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn: Vincent McBurney LinkedIn
Rate this response:  
Not yet rated
kevink
Participant



Joined: 16 Oct 2013
Posts: 7
Location: Sydney
Points: 83

Post Posted: Thu Oct 17, 2013 1:49 am Reply with quote    Back to top    

I am using QualityStage but, because there is so much extraneous information and so many different ad hoc formats, the AUAREA and AUADDR rule sets are proving quite poor at parsing out significant information.

There are some 377 different input patterns using AUAREA rule set. Over 30% of records have unhandled patterns NOT NULL when investigated. About half of these have the unhandled pattern the same as the input pattern.

People doing data entry are no doubt trying to be helpful but, for example, an address like 51 ST MARKS DRIVE ST MARKS WOODSIDE (no postcode or state) is a nightmare. The LocalityName should be ST MARKS while the WOODSIDE is additional information (ExceptionData?)

There are lots of misspellings, too (for example eleven different spellings of AUSTRALIA not counting the ones in state names).

I guess what I'm seeking to learn is efficient ways to create pattern-action scripts, rather than trying to handle every possible input pattern.

Supplementary question - is there any way to debug PAL, for example to dump values at run time?

_________________
Kevin K Tashadow
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54431
Location: Sydney, Australia
Points: 295208

Post Posted: Thu Oct 17, 2013 1:52 am Reply with quote    Back to top    

Have you tried running four Standardization stages, passing the unhandled data from each to the next, perhaps adding in the postcode and state ahead of the second AUAREA standardization? Code: ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne (especially seeking good business analysts)
Rate this response:  
Not yet rated
stuartjvnorton
Participant



Joined: 19 Apr 2007
Posts: 518
Location: Melbourne
Points: 3853

Post Posted: Thu Oct 17, 2013 4:35 am Reply with quote    Back to top    

Splitting at the first T is a problem, because you'll get it wrong every time you hit something like St Kilda Rd, or High St Rd.

Obviously the word investigation will be a big thing: the more you can classify, the easier your life will be in some ways. You will also see the longer words where you can use the optional inexact comparison thresholds.
Also look to the sep and strip lists. You might see some unusual punctuation used and letting it through to handle specifically may help you out as well.


There will be a lot of different formats, especially when it's all in 1 field.
Look at the way AUADDR does it: it spends a lot of time up front chipping away at things to reduce the number of patterns before it gets the the main street part.

Multiple_Semantics - very important for terms like ST, UPPER, FLAT, etc that mean differ things in different places. That list needs some work.

Concatenated_Tokens (?) splits out the > and < based on known alphabetical parts.

Unit, Level, Postal_Address (?) all do the same thing: take parts of the address that by themselves have limited sets of patterns, but multiply the number of overall address patterns. They take out the relevant bits and leave the remainder of the address a bit simpler to look at.

It's quite weak by default when it comes to things like shopping centres or malls/arcades. Property and building names, corner-of, etc as well.
Postal addresses, especially Care PO could also use some work.

Like Ray said, you can try things like chaining one into the other: AUADDR -> AUAREA or I guess even AUAREA -> AUADDR and see what you get.

You might also be able to fit the Aus Post locality file in there, to better pick the end of the street and the start of the locality, especially when the locality name starts with a T or an X.


As for debugging, there is nothing at the moment. What I've done in the past is create an extra field to hold trace info, then label every subroutine and pattern in the file with a short code. Then as the pattern is hit, the first thing it does is append the code to the trace field.
This will show you which subroutines were entered and which patterns were hit (as opposed to the ones you expected to hit).

It's a lot of effort to go through it all (even worse for AUNAME!), but the effort pays for itself.
Rate this response:  
Not yet rated
vmcburney

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup

Joined: 23 Jan 2003
Posts: 3590
Location: Australia, Melbourne
Points: 28109

Post Posted: Thu Oct 17, 2013 6:11 pm Reply with quote    Back to top    

There is address standardisation, which QualityStage can do out of the box, then there is Address Verification where a postal address file is used to verify it is a real address and return a delivery point identifier (DPID). With QualityStage the DPID is an add on module with a different license fee and is an OEM of technology from Hopewiser. This is where spelling mistakes, area mistakes and postcode mistakes can be fixed up. Australia Post have very specific rules about how much alteration you can make to an address to verify it is valid - they do not want tonnes of mail going to the wrong addresses just because of automated software glitches - the software gets verified by Australia post as being compliant. Small spelling mistakes and certain postcode/area mistakes can be repaired by verification software.

_________________
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn: Vincent McBurney LinkedIn
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours