Reference Match Input Considerations
-
- Premium Member
- Posts: 21
- Joined: Fri Feb 01, 2013 9:13 am
Reference Match Input Considerations
We're putting together our first Quality Stage match job and using product type data. For p.o.c. purposes, we've been using datasets as standardization/frequency output and ref. match input. During the match job, it seems that the match stage is waiting for the entire reference file to be read into memory before doing any matching. Is this really the case? If we were using a database as ref match input, would it attempt to read the entire reference table into memory before blocking and matching?
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
When you created the match frequency job, did you specify your match spec in the match frequency stage?
If you did, it only gets match frequencies for the fields you are matching on.
If you didn't, it gets match frequency data on all fields (including the primary key) so your match frequency data will be largely overhead.
If you did, it only gets match frequencies for the fields you are matching on.
If you didn't, it gets match frequency data on all fields (including the primary key) so your match frequency data will be largely overhead.
-
- Premium Member
- Posts: 21
- Joined: Fri Feb 01, 2013 9:13 am
Yeah. The RefFreq input is pretty small (<200). But our entire Reference input will eventually be on the order of several 100 Million rows.
When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?
When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?
When performing reference matches it is a good idea to reduce the amount of data in the reference set, if possible. I don't know your data but, for example, if you were using address data there is no point in loading reference addresses for London when the source only contains addresses from Edinburgh, Swindon and Bristol.
Hope that makes sense.
Cheers,
Bob.
Hope that makes sense.
Cheers,
Bob.
Bob Oxtoby
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
OK, now I get what you're after.ymadden@bmi.com wrote:Yeah. The RefFreq input is pretty small (<200). But our entire Reference input will eventually be on the order of several 100 Million rows.
When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?
By default, it will load your whole reference set. Why wouldn't it? You're the one telling the job what to do, after all. It also needs to do it in advance: it doesn't want to miss potential matches because the reference set is incomplete.
If you want to limit the size of the reference set, you would need to filter the reference set yourself. IIRC, this is pretty common for SOA reference match jobs, so you should be able to pick up some tips from the ISD guides.
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
Also, if you are using columns in the match spec with large cardinality, like a social security number or a location id, go to Configure Specification ->Variable Special Handling in the match designer, and set the NOFREQ action to those fields.
That will exclude them from the match frequency generation and greatly reduce the size of the match frequency file.
That will exclude them from the match frequency generation and greatly reduce the size of the match frequency file.
-
- Premium Member
- Posts: 21
- Joined: Fri Feb 01, 2013 9:13 am
Thanks guys, that all makes sense.
As a follow up - We're generating several match key columns through a standardization job that are solely used for blocking and matching (NYSIIS, Acronyms, tokens, etc.). Should these match keys be persisted in the database alongside the data, or should these be generated in flight as the reference data is read from the database?
As a follow up - We're generating several match key columns through a standardization job that are solely used for blocking and matching (NYSIIS, Acronyms, tokens, etc.). Should these match keys be persisted in the database alongside the data, or should these be generated in flight as the reference data is read from the database?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 21
- Joined: Fri Feb 01, 2013 9:13 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 21
- Joined: Fri Feb 01, 2013 9:13 am
Our plan is to run the match against the database. So this brings me back to my original question - Assume I run a match job with a DB Connector as the reference input. Does the match job issue one single select statement at the beginning of the run to pull the entire reference set into scope for blocking and matching? Or does the job issue a separate select statement (with blocking conditions in a where clause) for each record in the data set?
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
-
- Premium Member
- Posts: 21
- Joined: Fri Feb 01, 2013 9:13 am