Qualitystage match process performance tuning.

vijaydasari · Post by **vijaydasari** » Tue Jul 02, 2013 12:00 pm

I have a customer dimension data of 8 million records. Expected volume of customer data coming from source is up to 1 million (Delta records).
I am doing standardization of customer_master data and incoming source data before match process to find out match records & residual records.
Customer dimension whole volume I am taking into a dataset , in the same way souce customer data also.
Because of match process to bring whole volume of data into memory and compares data based on block/match columns mentioned in match specification. This process Is taking long time to complete and some times I am getting out of memory errors.
Is there any way I can limit customer_master data to bring only match records.

Thanks
Vijay

rjdickson · Post by **rjdickson** » Tue Jul 02, 2013 12:27 pm

Hi,

You should not get any out of memory errors. What version are you on?

Having said that, by far the most common reason for long reference match runs are blocks that contain too many records. You can look at the match statistics in the log to see something like the following for each pass (your numbers will, of course, be different):

Code: Select all

Maximum data block size (overflow blocks included) = 10
Average data block size (overflow blocks not included) = 1.455
Maximum reference block size (overflow blocks included) = 18
Average reference block size (overflow blocks not included) = 1.455

How big is your 'Maximum data block size' for each pass?

vijaydasari · Post by **vijaydasari** » Tue Jul 02, 2013 12:53 pm

I am using 8.5. Please see below entries from log.

But my concern is reference match job running long. if I can bring down customer dimension data volume then job might complete fast.

------------------------------ Match Pass 1 ------------------------
Number of data input records read = 557537
Number of reference input records read = 5831981
Number of input blocks processed = 78970
Number of blocks that overflowed = 0
Maximum data block size (overflow blocks included) = 252
Average data block size (overflow blocks not included) = 1.312
Maximum reference block size (overflow blocks included) = 3060
Average reference block size (overflow blocks not included) = 3.672
Number of matched pairs found = 103624
Number of exact matched pairs found = 103624
Number of clerical pairs found = 0
Number of duplicate data records found = 0
Number of exact duplicate data records found = 0
Number of duplicate reference records found = 626354
Number of exact duplicate reference records found = 626354
Number of residual data records (including those with null block columns) = 453913
Number of residual reference records (including those with null block columns) = 5831981
Total number of comparisons performed = 729978
------------------------------ Match Pass 2 ------------------------
Number of data input records read = 453913
Number of reference input records read = 5831981
Number of input blocks processed = 2833
Number of blocks that overflowed = 0
Maximum data block size (overflow blocks included) = 238
Average data block size (overflow blocks not included) = 3.371
Maximum reference block size (overflow blocks included) = 1885
Average reference block size (overflow blocks not included) = 23.232
Number of matched pairs found = 9550
Number of exact matched pairs found = 9550
Number of clerical pairs found = 0
Number of duplicate data records found = 0
Number of exact duplicate data records found = 0
Number of duplicate reference records found = 727647
Number of exact duplicate reference records found = 727647
Number of residual data records (including those with null block columns) = 444363
Number of residual reference records (including those with null block columns) = 5831981
Total number of comparisons performed = 737197

rjdickson · Post by **rjdickson** » Tue Jul 02, 2013 2:02 pm

The numbers look reasonable.

You can try to reduce the reference data, but the time it takes to do that will likely offset any savings you get in the match job itself. If you want to experiment, then select off the reference data that has the same values as your blocking criteria. Then use that as the reference input into your match.

You can also try, if your system resources and license allows it, run with more nodes.