DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
dj
Participant



Joined: 24 Aug 2006
Posts: 78
Location: india
Points: 1101

Post Posted: Tue Jul 26, 2016 5:03 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
Hi All,

I have setup the tutorial jobs for Match specification and trying to understand the rules. There are few clarifications.

1) what is the param1 signifcance? First_name it is 850 and last_name 800 and few columns with no param specified. On what basis it should be mentioned?

2) The lowest weight for the set of sample data is 14 and height-est being 50. The cut of value for match and clerical is set as 12. After verifying if the lowest wt pair is valid can we set some value after the lowest wt as cut-off?

How does the cutoff value set for a specific file works for other source system data?

3) On web-service call, how do we set the frequency file? For every array of call do we need to create the corresponding Match freq file for the data prior to Specification stage?

Though it depends on the data and requires iterative analysis, to get start with any inputs to the above would be of great help:-)

Thanks.
stuartjvnorton
Participant



Joined: 19 Apr 2007
Posts: 517
Location: Melbourne
Points: 3850

Post Posted: Tue Jul 26, 2016 6:34 pm Reply with quote    Back to top    

Ok, it's been a little while, but I'll have a crack:

1 - param1 is a parameter for the type of comparison you are using, and different comparisons obviously require different parameters (either 0, 1 or 2).
In the case of the 850 and 800, they would be for one of the uncert functions, and tells the function how strict it needs to be when determining whether the names match or not. 850 is the tightest, and 800 is not as tight but still quite strict.
Look to the doco as to what each comparison method requires.

2 - The cutoff values are specific to the specific match pass. If you are matching on 3 fields then a 20 might be quite a strong match, but for another match pass that compares 7 fields, a 20 might not even be a match.

As for setting the cutoff, you review the matches and clericals and poke around the non-matches to understand what looks like a reliable match and what doesn't.
Set the match cutoff with the idea that anything above it is a match that you're confident with (this will change depending on what you're using the matching for).
If you don't get consistent bands of scores that line up with what you think a match is or is not, tweak the scores for each of the individual field comparison scores to manipulate the overall score and create clearer bands of scores.

These scores tend to be relevant for the data you're matching. Data from other systems may have data that is more or less populated, consistent, etc and that will have an effect of what your matches look like.

3 - I would have thought that the frequency file relates to the data you are matching against, and how often you regenerate the frequency files depends on how volatile that data is.
Regardless, you would regenerate it separately to the match call itself.
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54372
Location: Sydney, Australia
Points: 294928

Post Posted: Wed Jul 27, 2016 12:00 am Reply with quote    Back to top    

On a web service call, you would quite likely use a two-file match. Yes, you would need to generate frequencies for the tiny number of rows arriving from the web client, but that will not take very l ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
dj
Participant



Joined: 24 Aug 2006
Posts: 78
Location: india
Points: 1101

Post Posted: Wed Jul 27, 2016 11:56 am Reply with quote    Back to top    

Thanks for the details,helped a lot for better understanding!

Regarding the frequency file generation, bit confused as we were just suggested to keep a default file.

1. Default file is definitely not entire source system data, its just generated for a particilar system.in this case will not affect the match specification output for other systems?
"For every match input data its corresponding freq should be generated", is it not the true?


Thanks
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54372
Location: Sydney, Australia
Points: 294928

Post Posted: Wed Jul 27, 2016 3:31 pm Reply with quote    Back to top    

The key word is "should". QualityStage will work without accurate match frequencies, but the probabilities that it uses will not be as accurate. You can certainly generate a standard set of frequenc ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
dj
Participant



Joined: 24 Aug 2006
Posts: 78
Location: india
Points: 1101

Post Posted: Thu Aug 11, 2016 4:30 am Reply with quote    Back to top    

Thanks Ray!

In the tutorial example Two source reference match, after running passes the Reference data and the data source all fall under MB.

In single unduplicate match, it is match and duplicate within the set.

But what does all the data source MB means all the candidate records are matching with the transaction record? is it not duplicate?

In the below example, even the records with missing address though have got low weight age it is as type MB.
(may be the specifications is working like that).

But in a 2 source reference match, when does record falls a duplicate type

what does match (MB)and duplicate(DB) means here?




Code:
      MA   T   TC001-T001   James   Fenimore   Cooper   1L Mohicans Run   Apartment 19C      Morgantown   WV   26505-0019   US
300   MB   C   TC001-C001   James   Fenimore   Cooper   1L Mohicans Run   Apartment 19C      Morgantown   WV   26505-0019   US
232   MB   C   TC001-C002   James   Fenimore   Cooper   P.O. Box 19C         Morgantown   WV   26505   US
232   MB   C   TC001-C003   James   Fenimore   Cooper            Morgantown   WV   26505   US
200   MB   C   TC001-C004   James   Fenimore   Cooper                     
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Mon Aug 15, 2016 6:21 am Reply with quote    Back to top    

Many questions in the same thread, but let me try to help...

Your last example does not have any DB records, nor does it report critical information to help with matching question. My personal common practice is to output set, type, pass, weight, and original data. You can then sort to see things in the right order. For two-file matches, you have both the reference and data on the same line....

In a two-file match, you will see DB records when you have selected 'many-to-one' match type. Please see: http://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.qs.ug.doc/topics/c_About_record_linkage.html

Another word about frequencies for real-time jobs... my commob practice is to never generate the frequencies for the incoming data - it is typically one records. Instead, use the frequencies from the last full-volume run/generation. This will give you a more accurate match.

_________________
Regards,
Robert
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours