Credibility of Checksum and Alternatives

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Credibility of Checksum and Alternatives

Post by jerome_rajan »

We're trying to do some work on trying to find if the incremental data provided by our client's partners are indeed reliable and if yes, to what degree. To achieve this, we are performing the below exercise:

1. Day 1 - Take a full extract of the source from the partner for Day 1
2. Day 2 - Take a full extract of the source from the partner for Day 2 and also accept the incremental feed that they normally provide (i.e. Change data between day 1 and day 2)
3. Do a column-by-column comparison to check for any changes between the two full extracts we took on Day 1 and Day 2 and compare the output with the Incremental feed provided by the partner.

Now, the tables in question have volumes anywhere between 5 million to 60 million and each table can have almost 40-50 columns. To do a column-by-column comparison does not seem like a very good idea and I suggested using the checksum strategy.

My question : Is the CRC32 reliable for the above mentioned range of volumes? Is there a better alternative to achieve what we're trying to do here?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Hi Jerome,

the important question is: what do you want to achieve by comparing CRC32 of two data-rows instead of comparing the column values themselves?

You'd still have to read all rows and all columns from both extracts and to generate the Checksum for both. So you do not reduce the amount of data that has be imported in the job. I do not doubt CRC32 to be sufficiently reliable when You keep the primary key of your data and generate the Checksum for all other columns. But I doubt your job is going to run much faster.

DataStage is easily capable of handling 5 to 60 million data-rows in ChangeCapture when your running it on an adequately sized unix-machine.
You say that You want to verify the reliability of Your business-partners. But You don't want to do that in a daily batch-job, do You?
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

I thought that comparing a single Checksum column would be more efficient than comparing each of the 40-50 columns. This is my idea behind using Checksum.

To answer your second question - This is not going to be a daily batch job that's going to go live but is going to be run every day for the next one month to monitor and get a good sample to arrive on a conclusion
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Hi Jerome,

Yes, sure it is. I just doubt your performance gains will be all that significant in a job You are just going to use during a certain test-period.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Assuming significant performance gains, is CRC32 reliable enough for the data volumes mentioned? Are there optimal alternatives to this approach?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I wouldn't assume anything (especially anything 'significant') but rather do a quick test to see what kind of speed you get from the CRC32 function and one comparison v. comparing all data columns. We're doing this using MD5 on sometimes a couple hundred columns (in another tool) mostly because it is easier than coding all of the checks and we don't have to worry about if it takes 'a little longer' to do it this way. And your data volumes have nothing to do with the 'reliability' of this approach... just the speed. And the only way to find out what that speed is is to code up some large volume tests.

If you do this for real, make sure you store the checksum in the target. That way you only ever have to compute it once per occurance.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

chulett wrote:...And your data volumes have nothing to do with the 'reliability' of this approach... just the speed....
This was the doubt I had. Since it's a 32 bit code, it would produce 2^32-1 values, right? Or is that understanding wrong?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Worse than that. It will produce at most 2^31-1 values, but only if your data provide that many distinct combinations. (You can improve this to 2^32 values using uint32, but the same caveat applies.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Thank you Ray, Craig and Roland. The information provided by you is sufficient for me to proceed with a POC and take a decision.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

8)

On the subject of "at most 2^32-1 values", I just wanted to chime in with the thought that (in my mind at least) it's not a concern for the application we're discussing. If you were trying to generate "unique" values like I've seen some people do in a misguided attempt to establish some kind of "intelligent" surrogate, then I'd be concerned. But you are basically just comparing X columns in one version of a single row with a new set of X columns for that same row to see if anything changed. I don't know what the odds are but I'd wager the odds of you losing a change because the new version of the row happened to produce the exact same checksum as the old version of that same row have to be... pretty darn small.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Possibly, but with the Change Capture, Difference or Compare stages there's no scope at all. Which is better odds than "pretty darn small" imho.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sure but that wasn't the topic of discussion. Besides, by pretty darn small I basically meant zero. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Jerome
If you are using checksum stage then you have luxury of dealing in huge data.
The checksum stage produces a 32 byte code(datatype char(32)).
So number of possible combinations will be 16^32 = 3.4028236692093846346337460743177e+38.
Thanx and Regards,
ETL User
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Woww! For real?! Is it because the checksum generated is hexadecimal in nature?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

Yup, you are right.
Check the below post as well:

viewtopic.php?t=145383
Thanx and Regards,
ETL User
Post Reply