DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
vijaydasari
Participant



Joined: 22 Jul 2007
Posts: 27

Points: 318

Post Posted: Tue Jul 02, 2013 12:00 pm Reply with quote    Back to top    

DataStage® Release: 8x
Job Type: Parallel
OS: Unix
I have a customer dimension data of 8 million records. Expected volume of customer data coming from source is up to 1 million (Delta records).
I am doing standardization of customer_master data and incoming source data before match process to find out match records & residual records.
Customer dimension whole volume I am taking into a dataset , in the same way souce customer data also.
Because of match process to bring whole volume of data into memory and compares data based on block/match columns mentioned in match specification. This process Is taking long time to complete and some times I am getting out of memory errors.
Is there any way I can limit customer_master data to bring only match records.

Thanks
Vijay

_________________
Vijay
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Tue Jul 02, 2013 12:27 pm Reply with quote    Back to top    

Hi,

You should not get any out of memory errors. What version are you on?

Having said that, by far the most common reason for long reference match runs are blocks that contain too many records. You can look at the match statistics in the log to see something like the following for each pass (your numbers will, of course, be different):
Code:
Maximum data block size (overflow blocks included) = 10
Average data block size (overflow blocks not included) = 1.455
Maximum reference block size (overflow blocks included) = 18
Average reference block size (overflow blocks not included) = 1.455


How big is your 'Maximum data block size' for each pass?

_________________
Regards,
Robert
Rate this response:  
Not yet rated
vijaydasari
Participant



Joined: 22 Jul 2007
Posts: 27

Points: 318

Post Posted: Tue Jul 02, 2013 12:53 pm Reply with quote    Back to top    

I am using 8.5. Please see below entries from log.

But my concern is reference match job running long. if I can bring down customer dimension data volume then job might complete fast.

------------------------------ Match Pass 1 ------------------------
Number of data input records read = 557537
Number of reference input records read = 5831981
Number of input blocks processed = 78970
Number of blocks that overflowed = 0
Maximum data block size (overflow blocks included) = 252
Average data block size (overflow blocks not included) = 1.312
Maximum reference block size (overflow blocks included) = 3060
Average reference block size (overflow blocks not included) = 3.672
Number of matched pairs found = 103624
Number of exact matched pairs found = 103624
Number of clerical pairs found = 0
Number of duplicate data records found = 0
Number of exact duplicate data records found = 0
Number of duplicate reference records found = 626354
Number of exact duplicate reference records found = 626354
Number of residual data records (including those with null block columns) = 453913
Number of residual reference records (including those with null block columns) = 5831981
Total number of comparisons performed = 729978
------------------------------ Match Pass 2 ------------------------
Number of data input records read = 453913
Number of reference input records read = 5831981
Number of input blocks processed = 2833
Number of blocks that overflowed = 0
Maximum data block size (overflow blocks included) = 238
Average data block size (overflow blocks not included) = 3.371
Maximum reference block size (overflow blocks included) = 1885
Average reference block size (overflow blocks not included) = 23.232
Number of matched pairs found = 9550
Number of exact matched pairs found = 9550
Number of clerical pairs found = 0
Number of duplicate data records found = 0
Number of exact duplicate data records found = 0
Number of duplicate reference records found = 727647
Number of exact duplicate reference records found = 727647
Number of residual data records (including those with null block columns) = 444363
Number of residual reference records (including those with null block columns) = 5831981
Total number of comparisons performed = 737197

_________________
Vijay
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Tue Jul 02, 2013 2:02 pm Reply with quote    Back to top    

The numbers look reasonable.

You can try to reduce the reference data, but the time it takes to do that will likely offset any savings you get in the match job itself. If you want to experiment, then select off the reference data that has the same values as your blocking criteria. Then use that as the reference input into your match.

You can also try, if your system resources and license allows it, run with more nodes.

_________________
Regards,
Robert
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours