Match Frequency

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Match Frequency

Post by samyamkrishna »

Hi All,

One of our ETL batch is running for 8 hours.
This has 6 Match jobs in it.

The match uses Unduplicate match and the matches are for Individual, Org , Address, Phone etc.
Each of them run for more than hour.

But the Frequency file is generated from a row generator with 1 row and all the columns thats required for all the six Match Jobs.

My question is.

If we create actual frequency files using the data instaed of a row generator.
The frequency file will have more details than the present frequency file.

Will this help in improving the performance of the match jobs because it has actual data rather than a dummy frequency file?

Regards,
Samyam
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Define what you mean by "performance" in this context. Certainly generating frequencies will generate more accurate results (for a large enough sample) than an artificially flat frequency distribution.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Hi Ray,

I bought the premium membership yesterday 26th Nov.
But i am still not able to see the content you posted usder premium.

I got a mail from rick stating that i will get another mail of confirmation.
But how long do you think it will take to get this membership.

Should i also contact editor@ liek in one of the recent posts.

Regards,
Samyam
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Wait till the weekend is over.

It won't do any harm to contact editor@dsxchange.net, but these people actually have a life as well as running DSXchange.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Define "performance". Number of matches, "quality" of matches (lower false positives / negatives), execution time?

As Ray said, the quality of score may improve a little by using accurate frequency data.

As for execution time, the number of records and number of match passes would be the first things to look at for understanding how much time is reasonable to expect it to take, and where you might be able to improve the times.
Also note that it takes time to create match frequency files.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Take a look at your match specification. If you are using overrides for every column, then generating frequencies will not matter as the overrides can take priority.

Is the original question based on curiosity, or are you having quality issues in your matching?
Regards,
Robert
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Stuart,

I am worried about the execution time.
Thanks for giving those hints on what to look at.

Will look at them to get to a conclusion.


rjdickson,

Yes there are overides. but i am not sure if its for all the columns.
will check that too.

The question is not based on curiosity. we are having issues with the run times due to a short execution window on production.
Cheers,
Samyam
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

The most frequent cause of bad match performance is (arguably) blocking fields that are too 'loose' (include too many candidate records).

Do you know what pass is causing issues? (Blocking fields are per pass).

The next thing you can look at is the job design. I would assume there is some sort of read from a database for the reference link. Does that read have a 'where' clause, and if so, is the column(s) used in the where clause indexed?
Regards,
Robert
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

I dont have access to director on Prod.
Try to get the access.

Will post my findings once i get hold of the logs.
Cheers,
Samyam
Post Reply