Column to column comparison of DB2 and Hive tables

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
nimcurry
Participant
Posts: 7
Joined: Mon Nov 02, 2015 10:42 am
Location: toronto

Column to column comparison of DB2 and Hive tables

Post by nimcurry »

I have 40 tables in DB2 which have been migrated to Hive in same exact structure as DB2. I need to compare column by column the values of each DB2 table with each Hive table. The column names for all 40 tables is not the same.

I have designed a parallel job which reads both tables, looks up on a key and combine columns from both tables and then checks for differences using a transformer. Any differences are captured in file.

I wonder if there is any way to can make this job generic considering that lookup key column and the columns to check for differences in transformers are different for all the tables?
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

You can.

One way is to make a shared container with N input columns and M key columns.

You can use a transformer or modify or whatnot to rename your actual metadata columns to the temporary names in the generic container, and after the exit of the container you can rename them back. Its a little tedious, but it works. If you do it via the modify stage, I think you can generate the statements into text files via automation to make this less painful.
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

An old Cobol programmer suggestion learned from an even older Cobol programmer...

If the two versions of data are identically formatted -- meaning there's no difference between DB2 and Hive -- you start by doing a bulk unload on each side. You can do a full-record comparison to identify rows with differences, then use the output from that as input to a DataStage process that completes the rest of the tasks you mention.

You must know the exact length of the row as written to the file. Your comparison is on that length.

You don't need column names until after the comparison output. You have flexibility at that point to make sure columns are mapped or traced according to the downstream requirements.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

FranklinE wrote:An old Cobol programmer suggestion learned from an even older Cobol programmer...
Thanks, needed a chuckle today. :D
-craig

"You can never have too many knives" -- Logan Nine Fingers
nimcurry
Participant
Posts: 7
Joined: Mon Nov 02, 2015 10:42 am
Location: toronto

Post by nimcurry »

I was able to create a job and do a comparison. I am using a paramterized filter stage to compare a decimal column. For eg, I used AVG from hive (which is decimal (30,11)) and AVG as AVG_DB2 from db2 table (which is also decimal (30,11)). My code in the filter stage was as follows:

Code: Select all

AVG <> AVG_DB2
However I get the following error when i run the job:
Filter_184: Error when checking operator: Caught parsing exception during wrapDescribeOperator(): Parsing parameters "" for schema type "decimal": Expected precision; got: <eof>
Wonder if any one has encountered this error before.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Perhaps an exact search for (at a minimum) "wrapDescribeOperator" here? Brought up quite a number of posts for me...
-craig

"You can never have too many knives" -- Logan Nine Fingers
nimcurry
Participant
Posts: 7
Joined: Mon Nov 02, 2015 10:42 am
Location: toronto

Post by nimcurry »

To explain this further, i tried with a string column with same condition:

Code: Select all

CISNO <> CISNO_DB2
and it worked. For that matter, a decimal column without an alias work as well:

Code: Select all

AVG <> AVG
But it actually compares to the same column. That being said, i tried searching for the error that i am getting, but could get any significant answers. ALso would like to mention that I am using RCP to populate this columns
Post Reply