USNAME clarification

Infosphere's Quality Product

Moderators: chulett, rschirm

dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

USNAME clarification

Post by dj »

Hi,

Iam trying to understand the USNAME Ruleset and hope someone had shed more lights on the below as iam new to qualitystage and trying to implement with the help of the document guides.

Input :
FirstName MidInit LastName
ROBERT & ANGIE STEGEMAN

Existing System output:
Std_FirstName std_midint std_lastname
ROBERT A STEGEMAN

After USNAME Ruleset:
FirstName_USNAME MiddleName_USNAME PrimaryName_USNAME AdditionalName_USNAME
ROBERT STEGEMAN ANGIE STEGEMAN

In the above example, the MidInitial is null from input.But the existing system output takes "Angies" 'A' as MidInitial. By quality stage USNAME standardisation ruleset MiddleName is null.

1) Iam not sure if the existing system output is correct or the quality stage one produces better standardised result.(Or only the business can confirm how they would need?)

2) MNS stage & Addres Verification stage both parses address field and outputs housenumber/street type/boxnumber. using any of the stage would suffice or there any difference/

3) Are there any rulesets which can derive census fields?

Thanks in advance.

Thanks
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi DJ,

1)
For ROBERT & ANGIE STEGMANN, you should see:
FirstName: ROBERT
PrimaryName: STEGMAN
AdditionalName: ANGIE STEGMAN

For ROBERT STEGMANN ANGLE STEGMAN you should see:
UnhandledData: ROBERT STEGMANN ANGLE STEGMANN

The above is 'out-of-the-box', so I suspect your rule sets have been overridden or otherwise modified. What is the UserOverrideFlag value for each test case?

2) AVI has reference data behind it, so it can verify the address exists, and potentially correct it. MNS is 'just' standardization. Also, AVI handles 240+ countries, while MNS is less.

3) The latest version of QualityStage US Address Certification (USAC) adds MSA and FIPS codes to the output.

I hope this helps!
Regards,
Robert
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

Thanks for the reply

For ROBERT & ANGIE STEGMANN, you should see:
FirstName: ROBERT
PrimaryName: STEGMAN
AdditionalName: ANGIE STEGMAN

There are no override flag and the output is as expected.

i could see there so many Unhandled data which the ruleset is not able to standardize and there so many patterns.

There are also few records where data quality issues like M0rgan,STEP4HAN.

How does one concludes to overwrite which patterns ? Does this not defeat the purpose of the standardization.

If iam trying to customize the ruleset ,what about in Production run when there is some new unhandled pattern arising?

Pardon my ignorance as i am just trying to get started.

Thanks,Dj
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi DJ,

As you noted, the challenge is that when you look at an unhanded pattern (like @,@ for M0rgan,STEP4HAN), you have no idea if that pattern occurs once or many times.

I want to spend my time fixing the things that are causing the most problems. So for me, the method I use is to create a process after standardization that helps me understand the most frequent unhanded pattern. Aggregate with count on unhanded pattern, and join back to the input.

As for production, sure, new data always comes in :) One technique is to use the job you just created above and run that in production. Then you can set thresholds and evaluate when you need to look at adding new patterns.

I hope this helps!
Robert
Regards,
Robert
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Typically you would run a character investigation over the unhandled patterns, specifying a few (I use 5 usually) in the sample size. The frequency distribution will immediately suggest where the most benefit for your effort would likely obtain.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

+1 for Ray and Robert's approaches.

- Token analysis to understand the + and whether they would add value to classify. This can often direct a lot of records to existing patterns for very little effort.

- Further character analysis to see what's up with the @ tokens. You might need to split some or add a rule to handle a known character pattern to again lick up a lot of records easily.

- Group by unhandled pattern then get examples or also group by initial pattern. You may have a pattern that is very close to one that is handled, which represents a valid scenario the ruleset doesn't handle.
- Also grouping by initial word pattern can also show you where a partial pattern rule will help out big-time. The same way the subroutines are used to pluck known sub-patterns out of the data because they have a limit number of specific meanings (the ADDR rulesets use this technique a lot), you can do the same to greatly reduce the variation in your patterns before they go through the main list of patterns.
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

Working on the unhandle records, thanks all for your responses
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

Can someone clarify the below

1) For Brazil, I'm unable to find the name ruleset similar to US or Canada.Are there any specific rulesets that has to be installed/downloaded separately

2) Does the Business and Person records cant be handle in the same standardization stage? Currently the records are filtered and handled separately and then joined back.
1) For Personal records
Filter->(Personal records)-standardization stage USNAME(Firstname,Mid,Lastname) (All records as Individuals) ->output
2) For Business records - USNAME(Businessname) -All records as Organisation

Thanks,Dj
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Hi, not sure if there's a Brazillian set of rulesets.
You may be able to create your own. I know they're not the same but the ES rulesets might be close enough to help somewhet.


You can split the names by Individual and Organisation if you already have them categorised. It would tend to be as reliable as your categorisation.

Otherwise you could use "default to" either Individual or Organisation. To make this more reliable, you'll need to add relevant terms to the O and W categories to get it to pick up more Org patterns. I would expect Default to Individual with a properly filled out list of O and W terms would work quite well.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Brazil is included, just not pre-loaded into the repository (http://www-01.ibm.com/support/knowledge ... rules.html).

On the client, go to C:\IBM\InformationServer\Clients\Classic\QSRules. You will see the Brazil rule set (read the 'readme.txt' for instructions on how to load it).
Regards,
Robert
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

Thanks Robert, we were able to import the brazil ruleset.

Why does the name rulesets overlaps the business and individual records? We wereable to see some of the 'I' records after standardization gets mapped name-type as O and viceversa.

Eg:lincon heightd ltd business record it treats incorrectly as I and populates frist, mid and lastname.

Can someone guide how to handle these overlapping records.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

Regarding the Brazil rule set:
You should be able to import the Brazil rule set. What was the error?

Regarding 'lincon heights ltd':
What ruleset did you use?
Also, 'LTD' is not Brazilian company suffix, so you may need to add an override to make LTD a class M (company suffix in the Brazilian rule set)
Regards,
Robert
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

1) Brazil Ruleset -
"Thanks Robert, we were able to import the brazil ruleset." There was no issues:-)

2) LINCON HEIGHTS LTD - CNAME ruleset -Though it is business record , it parses LINCON as "F" and Nametype as "I" and treat as Person record.

Below "I" records from Input are treated as "O" in CNAME.
FERNAND GERMAIN
RICAHRD PARIS
JULIEN DESILETS

How to handle these overlap records

3) For unhandled patterns -+I+(REJEAN A FORGUES) ,even though i add them to over-ride patterns, how do we derive other columns as Name Type, Gender columns.

Thanks
dj
Participant
Posts: 78
Joined: Thu Aug 24, 2006 5:03 am
Location: india

Post by dj »

1) Brazil Ruleset -
"Thanks Robert, we were able to import the brazil ruleset." There was no issues:-)

2) LINCON HEIGHTS LTD - CNAME ruleset -Though it is business record , it parses LINCON as "F" and Nametype as "I" and treat as Person record.

Below "I" records from Input are treated as "O" in CNAME.
FERNAND GERMAIN
RICAHRD PARIS
JULIEN DESILETS

How to handle these overlap records

3) For unhandled patterns -+I+(REJEAN A FORGUES) ,even though i add them to over-ride patterns, how do we derive other columns as Name Type, Gender columns.

Thanks
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

Can you please clarify the rule set name? CANAME or perhaps CNNAME?

For the others:
- Add LTD as an organization word
- what are the patterns for the I names?
- for Rejean, it depends on the rule set
Regards,
Robert
Post Reply