Reference Match Input Considerations

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Reference Match Input Considerations

Post by ymadden@bmi.com »

We're putting together our first Quality Stage match job and using product type data. For p.o.c. purposes, we've been using datasets as standardization/frequency output and ref. match input. During the match job, it seems that the match stage is waiting for the entire reference file to be read into memory before doing any matching. Is this really the case? If we were using a database as ref match input, would it attempt to read the entire reference table into memory before blocking and matching?
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

When you created the match frequency job, did you specify your match spec in the match frequency stage?
If you did, it only gets match frequencies for the fields you are matching on.
If you didn't, it gets match frequency data on all fields (including the primary key) so your match frequency data will be largely overhead.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

Yeah. The RefFreq input is pretty small (<200). But our entire Reference input will eventually be on the order of several 100 Million rows.

When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?
boxtoby
Premium Member
Premium Member
Posts: 138
Joined: Mon Mar 13, 2006 5:11 pm
Location: UK

Post by boxtoby »

When performing reference matches it is a good idea to reduce the amount of data in the reference set, if possible. I don't know your data but, for example, if you were using address data there is no point in loading reference addresses for London when the source only contains addresses from Edinburgh, Swindon and Bristol.

Hope that makes sense.

Cheers,
Bob.
Bob Oxtoby
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

ymadden@bmi.com wrote:Yeah. The RefFreq input is pretty small (<200). But our entire Reference input will eventually be on the order of several 100 Million rows.

When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?
OK, now I get what you're after.
By default, it will load your whole reference set. Why wouldn't it? You're the one telling the job what to do, after all. It also needs to do it in advance: it doesn't want to miss potential matches because the reference set is incomplete.

If you want to limit the size of the reference set, you would need to filter the reference set yourself. IIRC, this is pretty common for SOA reference match jobs, so you should be able to pick up some tips from the ISD guides.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Also, if you are using columns in the match spec with large cardinality, like a social security number or a location id, go to Configure Specification ->Variable Special Handling in the match designer, and set the NOFREQ action to those fields.
That will exclude them from the match frequency generation and greatly reduce the size of the match frequency file.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

Thanks guys, that all makes sense.

As a follow up - We're generating several match key columns through a standardization job that are solely used for blocking and matching (NYSIIS, Acronyms, tokens, etc.). Should these match keys be persisted in the database alongside the data, or should these be generated in flight as the reference data is read from the database?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Persist them in the database table, and index them.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

But what's the point of indexing them if QS does not block and match against the database?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A reference match can be (and often is) against the database.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

Our plan is to run the match against the database. So this brings me back to my original question - Assume I run a match job with a DB Connector as the reference input. Does the match job issue one single select statement at the beginning of the run to pull the entire reference set into scope for blocking and matching? Or does the job issue a separate select statement (with blocking conditions in a where clause) for each record in the data set?
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

And if you read my response, you'd know that it reads it all, unless you filter it yourself.
Like I also said, look at the guides for Internet Services Director and Real Time jobs, and you will see how they do exactly what you were asking for.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

Point taken. Thanks. We still have a lot to learn about this platform.
Post Reply