NUMERIC Comparison Takes Too Long

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
asyafrudin
Participant
Posts: 16
Joined: Thu Oct 21, 2010 1:40 am
Location: Indonesia
Contact:

NUMERIC Comparison Takes Too Long

Post by asyafrudin »

Hi,

I have 2 match specifications, MS1 & MS2. Both match specifications use only 1 match pass and both are basically the same except for 2 match commands, MCA & MCB. MCA & MCB compares columns containing numeric codes, so I decided to use the NUMERIC comparison method for both match commands in MS1. Out of curiosity, I wanted to compare the result using NUMERIC with UNCERT so I decided to use the UNCERT comparison method for both match commands in MS2. So now I have MS1 with NUMERIC method for MCA & MCB and MS2 with UNCERT method for MCA & MCB.

I test both match spefications in an identical environment using the same data source. What's funny is that MS1 took 183 hours to finish, while MS2 only took 44 hours to finish. Does using the NUMERIC function really made it significantly longer for the matching process to finish or is it something else? Anyone care to shed some light on this matter? I've tried googling to no avail.
Perfection is not about making no mistakes. Perfection is about fixing your mistakes.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

That is a lot of hours. How many records are involved?

Were the two methods run at the same time as each other?

Were there other jobs or a varying workload running during either test?

Are there a lot of other commands or steps? If so, it would help to create a test that only does that one step using NUMERIC and another that only does the one step using UNCERT. If you can run them on an identical sample set, such as 5%, then it won't take nearly as long to run several more comparisons.
Choose a job you love, and you will never have to work a day in your life. - Confucius
asyafrudin
Participant
Posts: 16
Joined: Thu Oct 21, 2010 1:40 am
Location: Indonesia
Contact:

Post by asyafrudin »

Hi, qt_ky. Thanks for the reply.

I'm using 6,136,023 data records and 25,757,890 reference records. I'm running both methods separately and I made sure there are no other jobs running. I'm using a server dedicated for InfoSphere, so I don't think there are any other apps or processes with significant workload running in the background.

The job only has on two-source match stage with typical dataset input and output. I am planning to isolate the NUMERIC comparison method and test it with a smaller sample as you suggested, but couldn't find the time to do that. Nevertheless, I'm still confused to why two NUMERIC match commands could make such a huge difference in running time. Any other pointers aside from the "smaller" test?
Perfection is not about making no mistakes. Perfection is about fixing your mistakes.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Can you take a look at the job log to see whether any particular process takes a very long time? Perhaps enable some of the reporting environment variables (especially APT_PLAYER_TIMING)?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

I wonder if the time is being spent in converting strings to numerics (ref: http://www-01.ibm.com/support/knowledge ... ison_.html)

I agree that is a lot of time, though.

Is your interest academic, or do you have a specific matching problem you are trying to solve? UNCERT is typically used on strings, and CNT_DIFF is designed for numbers (ref: http://www-01.ibm.com/support/knowledge ... isons.html)
Regards,
Robert
asyafrudin
Participant
Posts: 16
Joined: Thu Oct 21, 2010 1:40 am
Location: Indonesia
Contact:

Post by asyafrudin »

ray.wurlod wrote:Can you take a look at the job log to see whether any particular process takes a very long time?
I'll give it a shot perhaps some time next week. I'll post the results here when I have them.
Perfection is not about making no mistakes. Perfection is about fixing your mistakes.
asyafrudin
Participant
Posts: 16
Joined: Thu Oct 21, 2010 1:40 am
Location: Indonesia
Contact:

Post by asyafrudin »

rjdickson wrote:I wonder if the time is being spent in converting strings to numerics
I was also wondering about the same thing, however I don't think it made any sense that the job took so long just to convert strings to numbers. This made me really curious about the conversion process, but I can't find any information about it. Perhaps because it's proprietary?

The columns in MCA and MCB contains strings, but the strings are just numbers padded with zeros in front of them. What's annoying is that there's not standard regarding the amount of padded zeros. It's like someone can enter "001", "01", or just "1". All three of these examples are the same thing. That's why I'm using NUMERIC. I was hoping that NUMERIC would remove those padded zeros before making the comparison. I don't think CNT_DIFF is suitable for my case.
Perfection is not about making no mistakes. Perfection is about fixing your mistakes.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

Thanks for the explanation of the problem - it helps a lot!

You are correct that CNT_DIFF will not help. UNCERT is not designed for this issue either.

You are also correct in that NUMERIC will strip leading zeros. At least it did in my quick tests :)

One option is to use other stages to remove the leading zeros before the match. The two that come to mind immediately are in the Pattern Action Language (if this field is part of a Standardization), or a Transformer.

You can also open a PMR to IBM, but they will likely want to see your data to reproduce the test, so you may need to reduce the test to 'just' this numeric compare.
Regards,
Robert
Post Reply