Page 1 of 1

How to prevent CHECKSUM stage to re-arrange column names

Posted: Mon Jan 15, 2018 4:37 am
by rohit_mca2003
Hi Everyone,

I have issue while computing hash value using 'CHECKSUM' stage. It seems that CHECKSUM stage re-arranges the columns as per their names while computing the hash value.

Example:
----------
Source --> Col1, Col2, Test_Val

Case1: If I generate checksum keeping order as (Col1, Col2, Test_Val)
Case2: If I generate checksum keeping order as (Col1, Test_Val, Col2).

In both cases, I get same checksum value (but order of column I passed was different).

Is there any way I can restrict this stage to re-arrange the columns while computing the hash value.

Thanks.

Posted: Mon Jan 15, 2018 8:57 am
by chulett
Can't answer your question but I suspect the answer is 'no'. But I have to ask - does it really matter? What's important (IMHO) are the columns selected and the fact that they are used / ordered in a consistent manner. As someone who has had to clean up a mess made by developers hashing the same columns in a different order, I would gladly take that aspect of it out of their hands. :wink:

Posted: Mon Jan 15, 2018 10:10 pm
by rohit_mca2003
Thanks Craig.
I need to generate CHECKSUM based on the column order as there is requirement to have keys in specific order.
Existing application is running in production where hash was generated based on order of columns and to match with existing hash values we need to follow same order of keys/columns.

Posted: Tue Jan 16, 2018 8:15 am
by chulett
Best to open a support case then, the documentation doesn't seem to show that as an option. Out of curiosity, what 'existing application' was used to generate the checksum and what did it use to generate them? I've had 'issues' trying to match checksums from different systems, hence the question.

Posted: Thu Feb 08, 2018 5:11 am
by abhiramk
Before computing the checksum, the Checksum stage arranges the columns in the alphabetical order of column names (irrespective of the order they appear in your input column tab).

I have tried and tested a couple of possibilities and there seem to be no option to rearrange the column order in which checksum is calculated. It always takes the alphabetical order.

To add more, the checksum stage adds a pipe symbol at the end of each column and calculates HASH using MD5 algorithm.

For e.g, if you are using two columns COL1 and COL2 to compute HASH/Checksum, COL1 holds ALICE and COL2 holds BOB, checksum is computed on ALICE|BOB|

This "ALICE|BOB|" string doesn't change even if COL2 is your first column and COL1 is your second column. The columns are first arranged as COL1,COL2 and a pipe operator is inserted at the end of each column and checksum is computed using MD5 hash algorithm.

So, even if there is an option to rearrange the columns (which obviously is a speculation), the hash computed by your application will not match with the Datastage computed hash unless your application too adds a pipe symbol at the end of each column.

Posted: Mon Feb 12, 2018 10:05 am
by UCDI
can you DIY?

if you explicitly concat your data into a single column the checksum will be in the order you specify with any delimiters you specify (or not).

You can also toss the internal checksum and use an external one that is shared across your platforms, instead of trying to replicate datastage's checksum externally.

Posted: Fri Mar 02, 2018 3:07 am
by rohit_mca2003
chulett wrote:Best to open a support case then, the documentation doesn't seem to show that as an option. Out of curiosity, what 'existing application' was used to generate the checksum and what did it use to generate them? I've had 'issues' trying to match checksums from different systems, hence the question.
Earlier data was feeded from database so it is easy to concat the columns in SQL but the source has been changed to file and we can not read columns as concatenated.

Posted: Fri Mar 02, 2018 3:24 am
by rohit_mca2003
To answer everyone's query. we have resolved this issue and may be would be good use case for future.

I am aware that DataStage puts '|' after each field passed to 'checksum' operator and it uses HASH MD5.

I tried 2 solutions and both worked:

1. In generic stage, I wrote small code for 'transform' operator and concatenate the fields in order which I wanted. As my job is generic is nature so I used parameters for this. Then passed that concatenated column to 'checksum' operator.

2. We built logic to rename and prefix (using alphabet and numbers) all the source columns in such a way that even after re-arrangement of as per column names will be in same order.

Thanks for all the help. I am marking this as resolved with workaround.
Anyways I will raise a ticket for this.