Page 1 of 1

DataStage MD5 Implementation

Posted: Fri Aug 11, 2017 12:52 pm
by wpakkala
Does anyone have an MD5 "Stage" for DataStage 11.5 or tips of how to create one?

Also, it would need to match the perl digest::md5 results?

Posted: Fri Aug 11, 2017 2:59 pm
by PaulVL
why don't you make a build op calling your perl md5 stuff and use that?

Posted: Fri Aug 11, 2017 3:37 pm
by chulett
I would wager that would be the best (only?) way to ensure the results match.

Posted: Mon Aug 14, 2017 8:42 am
by PaulVL
Well, given that MD5 is an industry standard, any certified MD5 calculator should spit out the same result.

Introducing perl to the mix just to calculate that might be overkill.... depends how you would call it of course. If you spin up perl to calculate md5 for each row... that could be costly (up/down, up/down, up/down, etc...).

An external program to read it a file and concatenate the MD5 value... possible.

A Routine to add MD5 and put that in your transformer stage... possible.

Not sure if any databases out there has MD5 functions that can be called via stored procedure.

Posted: Mon Aug 14, 2017 11:08 am
by chulett
FWIW, Oracle does.

Posted: Mon Aug 14, 2017 8:54 pm
by Timato
FYI - the checksum stage spits out the MD5 of your fields.

The fields are concatenated, pipe delimited and appended with a trailing pipe.

Unfortunately not documented anywhere though.....

Posted: Tue Aug 15, 2017 9:33 am
by PaulVL
(I'm not a developer...)

So if he forks his data into the checksum stage, then how could he join it back to his main data and concatenate the checksum value to the correct data column?

Posted: Tue Aug 14, 2018 11:58 am
by asorrell
I know I'm resurrecting an old thread, but just encountered this, and it IS now documented.

The checksum stage does use MD5, but unfortunately the checksum stage changes the data being hashed without telling you so that it won't match an externally generated hash value unless they also add pipes to the data values in the appropriate places.

DataStage Checksum stage, how is the result computed?
http://www-01.ibm.com/support/docview.w ... wg22009454

Posted: Tue Aug 14, 2018 12:02 pm
by chulett
Well, who in their right mind doesn't add pipes in all the appropriate places? :wink:

Posted: Tue Aug 14, 2018 5:49 pm
by vmcburney
You do have to add the pipes to do a proper checksum. Without a separator between fields your checksum could get false positive matches. The main issue is that it adds the pipe onto the end of the string you are performing the checksum on. Most manually coded MD5 functions will only add separators between field and not an extra one on the end. You can't remove that last pipe.

Posted: Tue Aug 14, 2018 7:15 pm
by chulett
Interesting.

My classic example when teaching newbies about this keeps it simple. Imagine a record with two fields, with "A" in the first field and "BC" in the second. And then a change comes through with "AB" in the first and "C" in the second.

Without the pipes (or some separator):
1: ABC
2: ABC

The checksum would be identical.

Once more with feeling (and pipes):
1: A|BC
2: AB|C

And you golden.
:D

Posted: Wed Aug 15, 2018 4:47 am
by qt_ky
It's nice to know the calculation is documented!

Posted: Fri Aug 24, 2018 2:32 am
by ArndW
Sorry for jumping in a bit late on this thread - I've been offline. I've built MD5 Operators a couple of times, they are quite easy. One can either find public-domain c++ code on the net or use a small interlude program which calls the libcrpyt md5 algorithm.

Code: Select all

#include <stdio.h>                                                   // Library containing "sprintf"                             //
#include <string.h>                                                  // Definitions for "strlen" and "strcpy"                    //
#include <openssl/md5.h>                                             // MD5 Definition                                           //
                                                                     //==========================================================//
char* md5(char* InString) {                                          // Method called from Datastage                             //
   unsigned char digest[MD5_DIGEST_LENGTH];                          // Binary function return value from md5                    //
   static char mdString[33];                                         // Function response string                                 //
   MD5_CTX ctx;                                                      // MD5 control structure definition                         //
   MD5_Init(&ctx);                                                   // MD5 control structure initialization                     //
   MD5_Update(&ctx, InString, strlen(InString));                     // Compute the MD5 value                                    //
   MD5_Final(digest, &ctx);                                          // Fill "digest" and "ctx" contents                         //
   for(int i = 0; i < MD5_DIGEST_LENGTH; i++) {                      // Loop to move result into character-array as 16 Hex values//
      sprintf(&mdString[i*2], "%02x", (unsigned int)digest[i]);      // Convert using "sprintf"                                  //
   } // end of for-next each character                               //                                                          //
   return (char*)mdString;                                           // Return computed md5 string                               //
} // end of method md5                                               //----------------------------------------------------------//
Compile this as a library from your OS and then create a Parallel routine definition as an "external function" where the library parth points the library object built above. It has one return value of type "char*" and one Input Parameter of the same type.