Page 1 of 2

USNAME clarification

Posted: Thu Nov 12, 2015 11:21 am
by dj
Hi,

Iam trying to understand the USNAME Ruleset and hope someone had shed more lights on the below as iam new to qualitystage and trying to implement with the help of the document guides.

Input :
FirstName MidInit LastName
ROBERT & ANGIE STEGEMAN

Existing System output:
Std_FirstName std_midint std_lastname
ROBERT A STEGEMAN

After USNAME Ruleset:
FirstName_USNAME MiddleName_USNAME PrimaryName_USNAME AdditionalName_USNAME
ROBERT STEGEMAN ANGIE STEGEMAN

In the above example, the MidInitial is null from input.But the existing system output takes "Angies" 'A' as MidInitial. By quality stage USNAME standardisation ruleset MiddleName is null.

1) Iam not sure if the existing system output is correct or the quality stage one produces better standardised result.(Or only the business can confirm how they would need?)

2) MNS stage & Addres Verification stage both parses address field and outputs housenumber/street type/boxnumber. using any of the stage would suffice or there any difference/

3) Are there any rulesets which can derive census fields?

Thanks in advance.

Thanks

Posted: Fri Nov 13, 2015 3:40 pm
by rjdickson
Hi DJ,

1)
For ROBERT & ANGIE STEGMANN, you should see:
FirstName: ROBERT
PrimaryName: STEGMAN
AdditionalName: ANGIE STEGMAN

For ROBERT STEGMANN ANGLE STEGMAN you should see:
UnhandledData: ROBERT STEGMANN ANGLE STEGMANN

The above is 'out-of-the-box', so I suspect your rule sets have been overridden or otherwise modified. What is the UserOverrideFlag value for each test case?

2) AVI has reference data behind it, so it can verify the address exists, and potentially correct it. MNS is 'just' standardization. Also, AVI handles 240+ countries, while MNS is less.

3) The latest version of QualityStage US Address Certification (USAC) adds MSA and FIPS codes to the output.

I hope this helps!

Posted: Sun Nov 22, 2015 12:51 pm
by dj
Thanks for the reply

For ROBERT & ANGIE STEGMANN, you should see:
FirstName: ROBERT
PrimaryName: STEGMAN
AdditionalName: ANGIE STEGMAN

There are no override flag and the output is as expected.

i could see there so many Unhandled data which the ruleset is not able to standardize and there so many patterns.

There are also few records where data quality issues like M0rgan,STEP4HAN.

How does one concludes to overwrite which patterns ? Does this not defeat the purpose of the standardization.

If iam trying to customize the ruleset ,what about in Production run when there is some new unhandled pattern arising?

Pardon my ignorance as i am just trying to get started.

Thanks,Dj

Posted: Sun Nov 22, 2015 2:13 pm
by rjdickson
Hi DJ,

As you noted, the challenge is that when you look at an unhanded pattern (like @,@ for M0rgan,STEP4HAN), you have no idea if that pattern occurs once or many times.

I want to spend my time fixing the things that are causing the most problems. So for me, the method I use is to create a process after standardization that helps me understand the most frequent unhanded pattern. Aggregate with count on unhanded pattern, and join back to the input.

As for production, sure, new data always comes in :) One technique is to use the job you just created above and run that in production. Then you can set thresholds and evaluate when you need to look at adding new patterns.

I hope this helps!
Robert

Posted: Sun Nov 22, 2015 5:51 pm
by ray.wurlod
Typically you would run a character investigation over the unhandled patterns, specifying a few (I use 5 usually) in the sample size. The frequency distribution will immediately suggest where the most benefit for your effort would likely obtain.

Posted: Mon Nov 23, 2015 5:12 pm
by stuartjvnorton
+1 for Ray and Robert's approaches.

- Token analysis to understand the + and whether they would add value to classify. This can often direct a lot of records to existing patterns for very little effort.

- Further character analysis to see what's up with the @ tokens. You might need to split some or add a rule to handle a known character pattern to again lick up a lot of records easily.

- Group by unhandled pattern then get examples or also group by initial pattern. You may have a pattern that is very close to one that is handled, which represents a valid scenario the ruleset doesn't handle.
- Also grouping by initial word pattern can also show you where a partial pattern rule will help out big-time. The same way the subroutines are used to pluck known sub-patterns out of the data because they have a limit number of specific meanings (the ADDR rulesets use this technique a lot), you can do the same to greatly reduce the variation in your patterns before they go through the main list of patterns.

Posted: Thu Nov 26, 2015 7:28 am
by dj
Working on the unhandle records, thanks all for your responses

Posted: Tue Dec 15, 2015 3:02 am
by dj
Can someone clarify the below

1) For Brazil, I'm unable to find the name ruleset similar to US or Canada.Are there any specific rulesets that has to be installed/downloaded separately

2) Does the Business and Person records cant be handle in the same standardization stage? Currently the records are filtered and handled separately and then joined back.
1) For Personal records
Filter->(Personal records)-standardization stage USNAME(Firstname,Mid,Lastname) (All records as Individuals) ->output
2) For Business records - USNAME(Businessname) -All records as Organisation

Thanks,Dj

Posted: Tue Dec 15, 2015 5:04 pm
by stuartjvnorton
Hi, not sure if there's a Brazillian set of rulesets.
You may be able to create your own. I know they're not the same but the ES rulesets might be close enough to help somewhet.


You can split the names by Individual and Organisation if you already have them categorised. It would tend to be as reliable as your categorisation.

Otherwise you could use "default to" either Individual or Organisation. To make this more reliable, you'll need to add relevant terms to the O and W categories to get it to pick up more Org patterns. I would expect Default to Individual with a properly filled out list of O and W terms would work quite well.

Posted: Thu Dec 17, 2015 6:29 am
by rjdickson
Brazil is included, just not pre-loaded into the repository (http://www-01.ibm.com/support/knowledge ... rules.html).

On the client, go to C:\IBM\InformationServer\Clients\Classic\QSRules. You will see the Brazil rule set (read the 'readme.txt' for instructions on how to load it).

Posted: Thu Jan 28, 2016 2:10 pm
by dj
Thanks Robert, we were able to import the brazil ruleset.

Why does the name rulesets overlaps the business and individual records? We wereable to see some of the 'I' records after standardization gets mapped name-type as O and viceversa.

Eg:lincon heightd ltd business record it treats incorrectly as I and populates frist, mid and lastname.

Can someone guide how to handle these overlapping records.

Posted: Sun Jan 31, 2016 8:46 pm
by rjdickson
Hi,

Regarding the Brazil rule set:
You should be able to import the Brazil rule set. What was the error?

Regarding 'lincon heights ltd':
What ruleset did you use?
Also, 'LTD' is not Brazilian company suffix, so you may need to add an override to make LTD a class M (company suffix in the Brazilian rule set)

Posted: Tue Feb 02, 2016 3:06 am
by dj
1) Brazil Ruleset -
"Thanks Robert, we were able to import the brazil ruleset." There was no issues:-)

2) LINCON HEIGHTS LTD - CNAME ruleset -Though it is business record , it parses LINCON as "F" and Nametype as "I" and treat as Person record.

Below "I" records from Input are treated as "O" in CNAME.
FERNAND GERMAIN
RICAHRD PARIS
JULIEN DESILETS

How to handle these overlap records

3) For unhandled patterns -+I+(REJEAN A FORGUES) ,even though i add them to over-ride patterns, how do we derive other columns as Name Type, Gender columns.

Thanks

Posted: Tue Feb 02, 2016 3:08 am
by dj
1) Brazil Ruleset -
"Thanks Robert, we were able to import the brazil ruleset." There was no issues:-)

2) LINCON HEIGHTS LTD - CNAME ruleset -Though it is business record , it parses LINCON as "F" and Nametype as "I" and treat as Person record.

Below "I" records from Input are treated as "O" in CNAME.
FERNAND GERMAIN
RICAHRD PARIS
JULIEN DESILETS

How to handle these overlap records

3) For unhandled patterns -+I+(REJEAN A FORGUES) ,even though i add them to over-ride patterns, how do we derive other columns as Name Type, Gender columns.

Thanks

Posted: Tue Feb 02, 2016 5:30 am
by rjdickson
Hi,

Can you please clarify the rule set name? CANAME or perhaps CNNAME?

For the others:
- Add LTD as an organization word
- what are the patterns for the I names?
- for Rejean, it depends on the rule set