populating parent and child tables in one job - Possible??

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

populating parent and child tables in one job - Possible??

Post by jreddy »

I have been trying to analyse the pro's and cons of trying to implement a job in which i populate both parent and child tables (oracle). The reasons for wanting to do it are 1) high volume of data 2) the child table needs to be populated with just the parent id and some default values to its columns and 3) If i put them in separate jobs, i cannot correctly identify the category of the child.. let me explain. ..with a simple example..

parent - gets inputs from 2 totally different sources and i need 2 separate jobs for them as they populate totally different columns in it and I DONT have a column in parent that identifies the source. I generate a surrogate key in job for the parent records. Now, all i need to populate in child are its own surrogate key, its parent id and the a logic involving category for a column.

Now if i were to put them in separate jobs, while populating child, i wouldnt know what the source was for a given parent id (from the parent table). so, i need to be able to have both in same job, but with the commit frequency and parallelism involved, i fear that child records might be rejected if their corresponding parent id's were not commited at that instant of time.

The only alternative i can think of is that while populating child table in separate job, use the same input sources as parent, determine the parent id by joining on parent table. But this has so much repetition..

So, looking for way somehow to be able to start populating child table only after parent table has finished loading in same job, or does this defeat the purpose of using parallel job.. please throw ur suggestions..
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

In a word, SANDBOX. You are missing the concept of using a sandbox to land, stage, cleanse, and do all surrogate assignment work. Once you are happy with results, extract your inserts and updates from the sandbox and blast it into the target. You'll of course argue about performance, extra work involved, blah blah blah. :lol: But, the merits, the abilities to bulk load, pre-assign, remove database lookups and deal with off-target high-performance reference objects/methods, more than make up for "perceived" performance issues.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

:) Well, actually we do have the prestaging area, but that was to retrieve data from numerous and varied sources into oracle tables, where it is transformed and cleansed. Then we are populating the target database which is highly normalised unlike the typical datawarehouse projects. Hence the constraints. And the task is more challenging as it is going into an application that uses oracle sequences for its ID generation on tables. Hence while populating data too, we are supposed to be using that sequence. Hence all this surrogate key business while populating to target..
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

..and forgot to add.. that the target database is going to be the backend to a web based aplication, which is being built and tested simultaneously.hence the decision to use oracle sequences.. so we dont cross each others tracks in integration.. we had our own sequences until development.. but in integration..we need towork together right...
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Ahhhhhhhhhh........splat

(sound of me jumping out the top window of the highest building in town)


Sucks to be you on this project. Your observations for your performance concerns are valid. Parents are processed first, children are processed second with surrogate foreign key substitutions of the parent natural keys. Doing all your parents first into the sandbox allows you to get them finished, then doing the children allows you to do both and verify everything is okay. Only extract valid parents and their children for target loading. Now, if you have to defer surrogate key assignments, as well as substitutions until loading because of Oracle sequence mandates, then your performance will be joining me flying out that window. You obviously then can't be using bulk loading to simply blast in your parents and children extracted from the sandbox (gee, that is a pretty good design, data is validated and everything ahead of time and bulk loaded).

Sorry, no insights for expediting your process.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

[quote]Sucks to be you on this project. [/quote]


8) actually, i am totally loving this project, coz it poses so many challenges, i am actually learning a lot about using datastage effectively. If it were like the regular datawarehousing projects, that i worked on earlier, i guess we do know well known do's and dont's before hand, which are tested and discussed, so its a piece of cake atleast 80% of the time..[/quote]
Post Reply