two-source probabilistic matching in real time

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

two-source probabilistic matching in real time

Post by qt_ky »

Is it possible to perform two-source probabilistic matching in real time with:

- data source: real-time request, one record at a time, via web service call to a job we deploy as an ISD application

- reference source: Oracle (akin to an Oracle sparse lookup, knowing that the reference records can be inserted and updated in real time by external processes)

Q1. It seems from the documentation that all the reference data first needs to be standardized, determine the frequency distribution, etc. That seems like a lot of overhead to run through for each and every request that needs to be matched in real time. Is that a correct understanding? I would hope not.

Q2. I had gathered from the 8.7 docs that all the match stage inputs must be persistent data sets. Again I hope I misunderstood, as that does not really make sense. In the 11.3.1 docs it says database stages can be inputs. Have the match input requirements loosened between 8.7 and 11.3.1?

The latter could be easily tested where as Q1 seems pretty involved. At this point it would help to clarify what is actually possible in real time, when Oracle has the ever-changing reference data.
Choose a job you love, and you will never have to work a day in your life. - Confucius
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Q1. True. Store selected standardized values in the reference table as additional columns. Thus you don't have to re-standardize anything. After you update the reference table, you will need to re-generate frequencies, but that can be done "off line".

Q2. Source only needs to be a persistent Data Set for the QS Match Designer. In a QS job, inputs to the Match stage can be a virtual Data Set (that is, come in on the input links from anywhere), although a common practice is to load Data Sets in prior jobs and run just the match in a job of its own - after all, match is a very resource-hungry operation.
Last edited by ray.wurlod on Fri Jan 09, 2015 8:12 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Great suggestion! Thank you.
Choose a job you love, and you will never have to work a day in your life. - Confucius
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Regarding your second question, you can use any kind of input you desire in your job. In fact, a Database stage is very common in your scenario.

Datasets are required when you use the Match Designer.

This has been the case since 8.0.
Regards,
Robert
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Thanks for clarifying!!
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply