Credibility of Checksum and Alternatives
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Credibility of Checksum and Alternatives
We're trying to do some work on trying to find if the incremental data provided by our client's partners are indeed reliable and if yes, to what degree. To achieve this, we are performing the below exercise:
1. Day 1 - Take a full extract of the source from the partner for Day 1
2. Day 2 - Take a full extract of the source from the partner for Day 2 and also accept the incremental feed that they normally provide (i.e. Change data between day 1 and day 2)
3. Do a column-by-column comparison to check for any changes between the two full extracts we took on Day 1 and Day 2 and compare the output with the Incremental feed provided by the partner.
Now, the tables in question have volumes anywhere between 5 million to 60 million and each table can have almost 40-50 columns. To do a column-by-column comparison does not seem like a very good idea and I suggested using the checksum strategy.
My question : Is the CRC32 reliable for the above mentioned range of volumes? Is there a better alternative to achieve what we're trying to do here?
1. Day 1 - Take a full extract of the source from the partner for Day 1
2. Day 2 - Take a full extract of the source from the partner for Day 2 and also accept the incremental feed that they normally provide (i.e. Change data between day 1 and day 2)
3. Do a column-by-column comparison to check for any changes between the two full extracts we took on Day 1 and Day 2 and compare the output with the Incremental feed provided by the partner.
Now, the tables in question have volumes anywhere between 5 million to 60 million and each table can have almost 40-50 columns. To do a column-by-column comparison does not seem like a very good idea and I suggested using the checksum strategy.
My question : Is the CRC32 reliable for the above mentioned range of volumes? Is there a better alternative to achieve what we're trying to do here?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Hi Jerome,
the important question is: what do you want to achieve by comparing CRC32 of two data-rows instead of comparing the column values themselves?
You'd still have to read all rows and all columns from both extracts and to generate the Checksum for both. So you do not reduce the amount of data that has be imported in the job. I do not doubt CRC32 to be sufficiently reliable when You keep the primary key of your data and generate the Checksum for all other columns. But I doubt your job is going to run much faster.
DataStage is easily capable of handling 5 to 60 million data-rows in ChangeCapture when your running it on an adequately sized unix-machine.
You say that You want to verify the reliability of Your business-partners. But You don't want to do that in a daily batch-job, do You?
the important question is: what do you want to achieve by comparing CRC32 of two data-rows instead of comparing the column values themselves?
You'd still have to read all rows and all columns from both extracts and to generate the Checksum for both. So you do not reduce the amount of data that has be imported in the job. I do not doubt CRC32 to be sufficiently reliable when You keep the primary key of your data and generate the Checksum for all other columns. But I doubt your job is going to run much faster.
DataStage is easily capable of handling 5 to 60 million data-rows in ChangeCapture when your running it on an adequately sized unix-machine.
You say that You want to verify the reliability of Your business-partners. But You don't want to do that in a daily batch-job, do You?
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
I thought that comparing a single Checksum column would be more efficient than comparing each of the 40-50 columns. This is my idea behind using Checksum.
To answer your second question - This is not going to be a daily batch job that's going to go live but is going to be run every day for the next one month to monitor and get a good sample to arrive on a conclusion
To answer your second question - This is not going to be a daily batch job that's going to go live but is going to be run every day for the next one month to monitor and get a good sample to arrive on a conclusion
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Assuming significant performance gains, is CRC32 reliable enough for the data volumes mentioned? Are there optimal alternatives to this approach?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
I wouldn't assume anything (especially anything 'significant') but rather do a quick test to see what kind of speed you get from the CRC32 function and one comparison v. comparing all data columns. We're doing this using MD5 on sometimes a couple hundred columns (in another tool) mostly because it is easier than coding all of the checks and we don't have to worry about if it takes 'a little longer' to do it this way. And your data volumes have nothing to do with the 'reliability' of this approach... just the speed. And the only way to find out what that speed is is to code up some large volume tests.
If you do this for real, make sure you store the checksum in the target. That way you only ever have to compute it once per occurance.
If you do this for real, make sure you store the checksum in the target. That way you only ever have to compute it once per occurance.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
This was the doubt I had. Since it's a 32 bit code, it would produce 2^32-1 values, right? Or is that understanding wrong?chulett wrote:...And your data volumes have nothing to do with the 'reliability' of this approach... just the speed....
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Worse than that. It will produce at most 2^31-1 values, but only if your data provide that many distinct combinations. (You can improve this to 2^32 values using uint32, but the same caveat applies.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Thank you Ray, Craig and Roland. The information provided by you is sufficient for me to proceed with a POC and take a decision.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
On the subject of "at most 2^32-1 values", I just wanted to chime in with the thought that (in my mind at least) it's not a concern for the application we're discussing. If you were trying to generate "unique" values like I've seen some people do in a misguided attempt to establish some kind of "intelligent" surrogate, then I'd be concerned. But you are basically just comparing X columns in one version of a single row with a new set of X columns for that same row to see if anything changed. I don't know what the odds are but I'd wager the odds of you losing a change because the new version of the row happened to produce the exact same checksum as the old version of that same row have to be... pretty darn small.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Woww! For real?! Is it because the checksum generated is hexadecimal in nature?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India