DataStage MD5 Implementation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
wpakkala
Participant
Posts: 7
Joined: Sun Jul 17, 2011 7:13 am

DataStage MD5 Implementation

Post by wpakkala »

Does anyone have an MD5 "Stage" for DataStage 11.5 or tips of how to create one?

Also, it would need to match the perl digest::md5 results?
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

why don't you make a build op calling your perl md5 stuff and use that?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I would wager that would be the best (only?) way to ensure the results match.
-craig

"You can never have too many knives" -- Logan Nine Fingers
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Well, given that MD5 is an industry standard, any certified MD5 calculator should spit out the same result.

Introducing perl to the mix just to calculate that might be overkill.... depends how you would call it of course. If you spin up perl to calculate md5 for each row... that could be costly (up/down, up/down, up/down, etc...).

An external program to read it a file and concatenate the MD5 value... possible.

A Routine to add MD5 and put that in your transformer stage... possible.

Not sure if any databases out there has MD5 functions that can be called via stored procedure.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

FWIW, Oracle does.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Timato
Participant
Posts: 24
Joined: Tue Sep 30, 2014 10:51 pm

Post by Timato »

FYI - the checksum stage spits out the MD5 of your fields.

The fields are concatenated, pipe delimited and appended with a trailing pipe.

Unfortunately not documented anywhere though.....
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

(I'm not a developer...)

So if he forks his data into the checksum stage, then how could he join it back to his main data and concatenate the checksum value to the correct data column?
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

I know I'm resurrecting an old thread, but just encountered this, and it IS now documented.

The checksum stage does use MD5, but unfortunately the checksum stage changes the data being hashed without telling you so that it won't match an externally generated hash value unless they also add pipes to the data values in the appropriate places.

DataStage Checksum stage, how is the result computed?
http://www-01.ibm.com/support/docview.w ... wg22009454
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Well, who in their right mind doesn't add pipes in all the appropriate places? :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You do have to add the pipes to do a proper checksum. Without a separator between fields your checksum could get false positive matches. The main issue is that it adds the pipe onto the end of the string you are performing the checksum on. Most manually coded MD5 functions will only add separators between field and not an extra one on the end. You can't remove that last pipe.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Interesting.

My classic example when teaching newbies about this keeps it simple. Imagine a record with two fields, with "A" in the first field and "BC" in the second. And then a change comes through with "AB" in the first and "C" in the second.

Without the pipes (or some separator):
1: ABC
2: ABC

The checksum would be identical.

Once more with feeling (and pipes):
1: A|BC
2: AB|C

And you golden.
:D
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

It's nice to know the calculation is documented!
Choose a job you love, and you will never have to work a day in your life. - Confucius
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Sorry for jumping in a bit late on this thread - I've been offline. I've built MD5 Operators a couple of times, they are quite easy. One can either find public-domain c++ code on the net or use a small interlude program which calls the libcrpyt md5 algorithm.

Code: Select all

#include <stdio.h>                                                   // Library containing "sprintf"                             //
#include <string.h>                                                  // Definitions for "strlen" and "strcpy"                    //
#include <openssl/md5.h>                                             // MD5 Definition                                           //
                                                                     //==========================================================//
char* md5(char* InString) {                                          // Method called from Datastage                             //
   unsigned char digest[MD5_DIGEST_LENGTH];                          // Binary function return value from md5                    //
   static char mdString[33];                                         // Function response string                                 //
   MD5_CTX ctx;                                                      // MD5 control structure definition                         //
   MD5_Init(&ctx);                                                   // MD5 control structure initialization                     //
   MD5_Update(&ctx, InString, strlen(InString));                     // Compute the MD5 value                                    //
   MD5_Final(digest, &ctx);                                          // Fill "digest" and "ctx" contents                         //
   for(int i = 0; i < MD5_DIGEST_LENGTH; i++) {                      // Loop to move result into character-array as 16 Hex values//
      sprintf(&mdString[i*2], "%02x", (unsigned int)digest[i]);      // Convert using "sprintf"                                  //
   } // end of for-next each character                               //                                                          //
   return (char*)mdString;                                           // Return computed md5 string                               //
} // end of method md5                                               //----------------------------------------------------------//
Compile this as a library from your OS and then create a Parallel routine definition as an "external function" where the library parth points the library object built above. It has one return value of type "char*" and one Input Parameter of the same type.
Post Reply