Column to column comparison of DB2 and Hive tables

nimcurry · Post by **nimcurry** » Thu Jan 18, 2018 11:39 am

I have 40 tables in DB2 which have been migrated to Hive in same exact structure as DB2. I need to compare column by column the values of each DB2 table with each Hive table. The column names for all 40 tables is not the same.

I have designed a parallel job which reads both tables, looks up on a key and combine columns from both tables and then checks for differences using a transformer. Any differences are captured in file.

I wonder if there is any way to can make this job generic considering that lookup key column and the columns to check for differences in transformers are different for all the tables?

UCDI · Post by **UCDI** » Thu Jan 18, 2018 12:02 pm

You can.

One way is to make a shared container with N input columns and M key columns.

You can use a transformer or modify or whatnot to rename your actual metadata columns to the temporary names in the generic container, and after the exit of the container you can rename them back. Its a little tedious, but it works. If you do it via the modify stage, I think you can generate the statements into text files via automation to make this less painful.

FranklinE · Post by **FranklinE** » Fri Jan 19, 2018 1:12 pm

An old Cobol programmer suggestion learned from an even older Cobol programmer...

If the two versions of data are identically formatted -- meaning there's no difference between DB2 and Hive -- you start by doing a bulk unload on each side. You can do a full-record comparison to identify rows with differences, then use the output from that as input to a DataStage process that completes the rest of the tasks you mention.

You must know the exact length of the row as written to the file. Your comparison is on that length.

You don't need column names until after the comparison output. You have flexibility at that point to make sure columns are mapped or traced according to the downstream requirements.

chulett · Post by **chulett** » Fri Jan 19, 2018 1:34 pm

FranklinE wrote:An old Cobol programmer suggestion learned from an even older Cobol programmer...

Thanks, needed a chuckle today. :D

nimcurry · Post by **nimcurry** » Wed Jan 24, 2018 4:06 pm

I was able to create a job and do a comparison. I am using a paramterized filter stage to compare a decimal column. For eg, I used AVG from hive (which is decimal (30,11)) and AVG as AVG_DB2 from db2 table (which is also decimal (30,11)). My code in the filter stage was as follows:

Code: Select all

AVG <> AVG_DB2

However I get the following error when i run the job:

Filter_184: Error when checking operator: Caught parsing exception during wrapDescribeOperator(): Parsing parameters "" for schema type "decimal": Expected precision; got: <eof>

Wonder if any one has encountered this error before.

chulett · Post by **chulett** » Wed Jan 24, 2018 4:44 pm

Perhaps an exact search for (at a minimum) "wrapDescribeOperator" here? Brought up quite a number of posts for me...

nimcurry · Post by **nimcurry** » Wed Jan 24, 2018 5:10 pm

To explain this further, i tried with a string column with same condition:

Code: Select all

CISNO <> CISNO_DB2

and it worked. For that matter, a decimal column without an alias work as well:

Code: Select all

AVG <> AVG

But it actually compares to the same column. That being said, i tried searching for the error that i am getting, but could get any significant answers. ALso would like to mention that I am using RCP to populate this columns