CheckSum Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rbk
Participant
Posts: 23
Joined: Wed Oct 23, 2013 1:10 am
Location: India

CheckSum Stage

Post by rbk »

Dear all,
Hope everyone is doing well.
I need some information regarding the CheckSum Stage.
BackGround:
My scenario goes something like this.
1. A source file having over 280 Columns and a target table with the same number of columns.
2. We need to perform CDC operation to capture the delta.
3. We have over 45 such file/table combinations with varying number of keys and column values.

I was wondering if we can create a generic job that can be reused. For that I am exploring the usage of Checksum stage to create checksum values and concatenation of the keys. So that I can use perform the comparisons easily.

Queries:
So my queries on the CheckSum Stage.
1. The output is always VARCHAR(64) (unicode). Can I depend on it to be unique always ? The reason I am asking this is that we are dealing with Master Data - with counts exceeding Millions. So what are my chances.

2. Any documentation I can read upon on the internal workings ? The reason for this is that I was wondering how CheckSum would handle such a scenario.

Say checksum is being generated for two columns A, B
Initial values: 1,2

Next day for the same key the values get changed to 2,1
In this case would they generate the same CheckSum Value ?
1,2 against 2,1 ?

3. Any other care to be taken on the checksum stage ?

Thank you in advance..
Cheers,
RBK
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

Without going into number theory, the quick answer to one of your questions is NO. There does not exist ANY algorithm that can take X bytes and map it into less bytes and NEVER have a repeated value. There exist MANY algorithms that can get you so close that the odds of a repeated value are so small as to be nearly zero, but it can never BE zero. It can only BE zero if you are mapping to more bytes than the original input data. You can check the keys if you fear a duplicate checksum; the odds of 2 records with the same key and the same checksum are even smaller than just the checksum and is more than sufficient to protect you from a collision.

Millions is not impressive.
2 to the 64th power divided by 1 million is:
18446744073709.
A 64 bit value is enormous.
That is, you can store that many MILLIONS of records using a numeric key without collision. That is, the 9 in the least position of that value represents 9 million records, just to be extremely clear how big this value is.


2) checksum is a well known algorithm, try google. The coefficients and polynomial under the hood change but the technique is well documented. If you do not like what you see, try SHA instead. You can download working SHA code in C and compile it and add it right into datastage. Ive done this, and it took all of 10 min.

3) No, it would not generate the same value. Checksums are set up to generate very different output for similar input. The checksum of abcd and abcf will be radically different values. By design.
Last edited by UCDI on Wed Sep 14, 2016 9:21 am, edited 1 time in total.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

Oh, other care...
youll want to be very careful what columns you use. If you do ALL the columns and one column is "meaningless" server information like who accessed the data most recently (that is, if that column changed, you really do not care) then you must not include it in the checksum. You can pick and choose what columns to checksum, the default is everything I think.

You can approach the problem various ways.. you can group columns and checksum a few at a time to figure out if a column in that group changed, or you can do it across the whole record. It just depends on what you need to do for your table, really. For example if this table is updated from 10 tables from 3 source systems, you could break it down into chunks assuming that if one system changed another might not have, if that makes sense to do here.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

And without going into all of the normal lecture behind it, for #1 you do not need the value to be 'always unique'.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply