division into jobs

and · Post by **and** » Fri Oct 13, 2017 6:52 am

hi all

i need to create two files (file set) for two lookups based on dbms data.

i can create two jobs for filling each file or one job for filling two files.

what is the best practice ?

one job per unit of job ? or may be one job per task

thanks

qt_ky · Post by **qt_ky** » Fri Oct 13, 2017 7:01 am

My personal preference generally is to use separate jobs for the sake of restartability. When something goes wrong and some jobs complete but another job aborts, then it can be easier and faster to troubleshoot the problem on a simpler job, and only the aborted job needs to be restarted. There's no extra processing or repeat processing of logic that already completed successfully.

chulett · Post by **chulett** » Fri Oct 13, 2017 7:24 am

+1

We always try to build jobs as atomic, restartable units of work for precisely the reasons mentioned above. And we wrap them in a "framework" that knows how to back out any partially completed loads (where applicable) so that restarts can be as "hands off" as possible.

and · Post by **and** » Fri Oct 13, 2017 9:38 am

qt_ky, chulett

thanks for notes

as for

And we wrap them in a "framework" that knows how to back out any partially completed loads (where applicable)

can you add more hints for this please

i have not so much expirience and this will be interesting for me

thanks

chulett · Post by **chulett** » Fri Oct 13, 2017 10:09 am

Well... a girl's got to keep some secrets.

But high level:

It's basically a set of tables that record what jobs have run and assign a unique id to each run of each job. All records inserted or updated by the run are tagged with that number. We also have control tables that document what tables each job targets and what 'rollback mechanism' to use for each. A stored procedure is called when a failed job has been restarted that looks up the mechanism and id for the run that failed and resets the table back to pre-run conditions. For example, type 2 updates have their new record deleted and the previous entry set back to 'current'.

All jobs have these pipelines incorporated into them, something we call their job control framework, set to run in the proper order:

1. Check for and perform rollback if needed
2. Initialize a new run in the control tables
3. <actual work goes here>
4. Finalize the run in the control tables

Note that we're currently using Informatica for this, which has a "Target Load Plan" setting where you specify the order the pipelines run in, one after the other, for any given mapping. Been long enough that I'm not quite sure how you would accomplish something equivalent in DataStage. Be curious if others are doing something similar.