Page 1 of 1

two-source probabilistic matching in real time

Posted: Thu Jan 08, 2015 10:58 am
by qt_ky
Is it possible to perform two-source probabilistic matching in real time with:

- data source: real-time request, one record at a time, via web service call to a job we deploy as an ISD application

- reference source: Oracle (akin to an Oracle sparse lookup, knowing that the reference records can be inserted and updated in real time by external processes)

Q1. It seems from the documentation that all the reference data first needs to be standardized, determine the frequency distribution, etc. That seems like a lot of overhead to run through for each and every request that needs to be matched in real time. Is that a correct understanding? I would hope not.

Q2. I had gathered from the 8.7 docs that all the match stage inputs must be persistent data sets. Again I hope I misunderstood, as that does not really make sense. In the 11.3.1 docs it says database stages can be inputs. Have the match input requirements loosened between 8.7 and 11.3.1?

The latter could be easily tested where as Q1 seems pretty involved. At this point it would help to clarify what is actually possible in real time, when Oracle has the ever-changing reference data.

Posted: Thu Jan 08, 2015 3:58 pm
by ray.wurlod
Q1. True. Store selected standardized values in the reference table as additional columns. Thus you don't have to re-standardize anything. After you update the reference table, you will need to re-generate frequencies, but that can be done "off line".

Q2. Source only needs to be a persistent Data Set for the QS Match Designer. In a QS job, inputs to the Match stage can be a virtual Data Set (that is, come in on the input links from anywhere), although a common practice is to load Data Sets in prior jobs and run just the match in a job of its own - after all, match is a very resource-hungry operation.

Posted: Fri Jan 09, 2015 7:01 am
by qt_ky
Great suggestion! Thank you.

Posted: Fri Jan 09, 2015 5:04 pm
by rjdickson
Regarding your second question, you can use any kind of input you desire in your job. In fact, a Database stage is very common in your scenario.

Datasets are required when you use the Match Designer.

This has been the case since 8.0.

Posted: Fri Jan 09, 2015 9:46 pm
by qt_ky
Thanks for clarifying!!