Page 1 of 1

Runtime comparision of tables using TD connector stage?

Posted: Wed Jan 16, 2019 3:09 am
by mouthou
Hi All,

We have about 700-800 existing Teradata tables which required to be migrated to a different Teradata DB server and compare the new data set with the existing set as and when DataStage jobs run in both the Teradata DBs.

Having left with no choice on the DB side as there is no database link etc to use to do a DIFF, interested to explore some options on having a single DataStage job which does the DIFFERENCE logic. Thought to check with the folks here for some clarifications or related inputs. Appreciate any details on this.

- Though we can pass the runtime SQL file to Teradata Connector stage for each table but the major issue is with the metadata. Is there an option we can bypass column definition (otherwise it is as worst as creating 700-800 jobs for this comparision)
- Alternatively can a Teradata Connector stage be used with a runtime schema file for the metadata and use RCP feature like done with Sequential files?

Posted: Wed Jan 16, 2019 5:39 am
by chulett
I don't have any specific thoughts to share this morning other than to note that RCP is not limited to sequential files. And if there's no database link and if that would help simplify this process, have one created. Temporarily, if nothing else.

Posted: Wed Jan 16, 2019 7:54 am
by PaulVL
Do you have to validate each row or does a "sampling" suffice?


You can extract from a table (based upon keys) sort the data and then instead of dumping to a sequential file all of that data, you could output a checksum value. Then just compare both checksums.

Another option is load system #2, then do the checksum extract via SQL from both systems and use DS to compare the sorted results.

Posted: Wed Jan 16, 2019 12:16 pm
by ray.wurlod
Do you really need DataStage for this? Does not Teradata have its own table transfer utilities?

Posted: Wed Jan 16, 2019 12:51 pm
by chulett
Yeah... wondering much of what has also been posted. It seems like there should be a native solution to this problem but if there isn't, I'd be looking for much more detail on exactly what kind of "comparisons" you need to be doing here.

Posted: Wed Jan 16, 2019 9:43 pm
by mouthou
Thanks for everyone's comments. To answer few questions from Ray and Chulett

- Yes I also expected a native solution to do within the DB but there isnt much known yet (It will be 2 different DB production servers and the options of playing around are limited). The kind of comparison is table-to-table to see if there is any difference in loading by the same DataStage migrated job on different server
- Yes we have to validate each row for any slight difference, even if the table contains 2 million rows(so PaulVL's options wont suffice)
- As Ray commented, I also thought to come up something within DB but not known yet. Thought of DataStage as the worst case since the validation job can be run from anywhere pointing to any DB, and there is no need for explicit columns assuming DataStage has RCP feature

Posted: Thu Jan 17, 2019 1:12 pm
by chulett
Still a little lost, but... is this a "one time" process or something you'll need to run repeatedly, say after daily loads against incremental data sets? I think it's the word "runtime" in your subject that's making me ask...

Posted: Fri Jan 18, 2019 3:47 am
by mouthou
I think runtime could be in any scenario, be it one time or incremental batch loads. I meant the runtime as to pass the something like tablename when the validation job runs.

This is a temporary validation process which will be run everyday on production for a month till the migration is signed off for correctness.

Posted: Fri Jan 18, 2019 5:19 am
by chulett
So... parallel testing, yes? Meaning for some period of time, you'll be taking the same daily data and loading it into two different database instances by running the same suite of jobs twice, once for each target? Or have you migrated DataStage there as well? Just saw the "the same DataStage migrated job on different server" comment. And then afterwards, you need to somehow compare the results to ensure they are identical. Just want to make sure we're all on the same page with a clear set of requirements.

Regardless, I don't see what about Paul's options "won't suffice". Sure, the sampling perhaps but nothing in what was posted after that was specific to sampling. High level checks could be simple row counts that would reveal an obvious / gross difference. Other than that, you'll need something manageable to work with at the gory details level. I would think that creating a checksum of the data elements in each record, identified by that table's unique key and dumping that for each target would be a manageable way to go. Both sets could be brought together (using DataStage if you like), the keys joined and the checksums compared. Missing keys and/or checksum mismatches could then be further investigated.

This rather than trying to compare the raw records.