Cleansing Data before Standardization

kevink · Post by **kevink** » Wed Oct 16, 2013 2:19 pm

Hi,

I've got some really dirty Australian address data, with lots of "added" information. The entire address is in a single VarChar field with no consistent format.

Can you suggest ideas for cleansing the data (using DataStage, with which I am quite familiar) or QualityStage? I am permitted to create a copy of the rule sets if necessary, and even to modify the pattern-action script.

Thanks,
Kevin

ray.wurlod · Post by **ray.wurlod** » Wed Oct 16, 2013 3:18 pm

Welcome aboard.

What have you tried? How successful was it?

kevink · Post by **kevink** » Wed Oct 16, 2013 6:32 pm

I have a DataStage job that splits the string at the first street type token in the data; the token and all preceding it go into the addr bucket while everything following the token goes into the area bucket. If no street type token exists then all the data go into the area bucket.

Using character investigation (as taught in the QualityStage training) I have identified over 100 input patterns for the AUAREA rule set that lead to unhandled data. In only a small proportion of these does the unhandled pattern match the input pattern. I am thinking to incorporate all of these in a new AREA rule set, but that may make it cumbersomely large, with associated performance penalty.

That's why I'm looking for ideas.

vmcburney · Post by **vmcburney** » Wed Oct 16, 2013 10:23 pm

Well, this is what QualityStage is for. Breaking the string fields into tokens and standardising it. Try QualityStage Investigation and Standardise steps. You could spend a year writing DataStage code to parse all the combinations, you need QualityStage.

kevink · Post by **kevink** » Thu Oct 17, 2013 1:49 am

I am using QualityStage but, because there is so much extraneous information and so many different ad hoc formats, the AUAREA and AUADDR rule sets are proving quite poor at parsing out significant information.

There are some 377 different input patterns using AUAREA rule set. Over 30% of records have unhandled patterns NOT NULL when investigated. About half of these have the unhandled pattern the same as the input pattern.

People doing data entry are no doubt trying to be helpful but, for example, an address like 51 ST MARKS DRIVE ST MARKS WOODSIDE (no postcode or state) is a nightmare. The LocalityName should be ST MARKS while the WOODSIDE is additional information (ExceptionData?)

There are lots of misspellings, too (for example eleven different spellings of AUSTRALIA not counting the ones in state names).

I guess what I'm seeking to learn is efficient ways to create pattern-action scripts, rather than trying to handle every possible input pattern.

Supplementary question - is there any way to debug PAL, for example to dump values at run time?

ray.wurlod · Post by **ray.wurlod** » Thu Oct 17, 2013 1:52 am

Have you tried running four Standardization stages, passing the unhandled data from each to the next, perhaps adding in the postcode and state ahead of the second AUAREA standardization?

Code: Select all

  --->  AUAREA  -->  AUADDR  -->  AUAREA  -->  AUADDR  -->

stuartjvnorton · Post by **stuartjvnorton** » Thu Oct 17, 2013 4:35 am

Splitting at the first T is a problem, because you'll get it wrong every time you hit something like St Kilda Rd, or High St Rd.

Obviously the word investigation will be a big thing: the more you can classify, the easier your life will be in some ways. You will also see the longer words where you can use the optional inexact comparison thresholds.
Also look to the sep and strip lists. You might see some unusual punctuation used and letting it through to handle specifically may help you out as well.

There will be a lot of different formats, especially when it's all in 1 field.
Look at the way AUADDR does it: it spends a lot of time up front chipping away at things to reduce the number of patterns before it gets the the main street part.

Multiple_Semantics - very important for terms like ST, UPPER, FLAT, etc that mean differ things in different places. That list needs some work.

Concatenated_Tokens (?) splits out the > and < based on known alphabetical parts.

Unit, Level, Postal_Address (?) all do the same thing: take parts of the address that by themselves have limited sets of patterns, but multiply the number of overall address patterns. They take out the relevant bits and leave the remainder of the address a bit simpler to look at.

It's quite weak by default when it comes to things like shopping centres or malls/arcades. Property and building names, corner-of, etc as well.
Postal addresses, especially Care PO could also use some work.

Like Ray said, you can try things like chaining one into the other: AUADDR -> AUAREA or I guess even AUAREA -> AUADDR and see what you get.

You might also be able to fit the Aus Post locality file in there, to better pick the end of the street and the start of the locality, especially when the locality name starts with a T or an X.

As for debugging, there is nothing at the moment. What I've done in the past is create an extra field to hold trace info, then label every subroutine and pattern in the file with a short code. Then as the pattern is hit, the first thing it does is append the code to the trace field.
This will show you which subroutines were entered and which patterns were hit (as opposed to the ones you expected to hit).

It's a lot of effort to go through it all (even worse for AUNAME!), but the effort pays for itself.

vmcburney · Post by **vmcburney** » Thu Oct 17, 2013 6:11 pm

There is address standardisation, which QualityStage can do out of the box, then there is Address Verification where a postal address file is used to verify it is a real address and return a delivery point identifier (DPID). With QualityStage the DPID is an add on module with a different license fee and is an OEM of technology from Hopewiser. This is where spelling mistakes, area mistakes and postcode mistakes can be fixed up. Australia Post have very specific rules about how much alteration you can make to an address to verify it is valid - they do not want tonnes of mail going to the wrong addresses just because of automated software glitches - the software gets verified by Australia post as being compliant. Small spelling mistakes and certain postcode/area mistakes can be repaired by verification software.