How to Stop jobs on a grid

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bobyon
Premium Member
Premium Member
Posts: 200
Joined: Tue Mar 02, 2004 10:25 am
Location: Salisbury, NC

How to Stop jobs on a grid

Post by bobyon »

Are there any special considerations when sttempting to stop jobs on a grid?

I've seen several posts on here regarding how to stop and how not to stop DataStage jobs. And, I know kill-9 falls in the center of how NOT to stop a job.

What I have not seen is anything that mentions if there are special steps to take or precautions to consider when stopping a grid enabled job. Especially one that may not respond to the stop button in Director.

Thanks,
Bob
Bob
daignault
Premium Member
Premium Member
Posts: 165
Joined: Tue Mar 30, 2004 2:44 pm
Contact:

Post by daignault »

Usually just the stop button on the director will stop the job. We do go on a weekly basis looking for osh or osh.exe jobs on all compute nodes and kill lingering processes.

Also, we make sure we purge the scratch directory. Many times the tsort will leave a bunch of files on the system.

Thanks in advance

Ray D
bobyon
Premium Member
Premium Member
Posts: 200
Joined: Tue Mar 02, 2004 10:25 am
Location: Salisbury, NC

Post by bobyon »

Ray,

Thanks for the reply. I agree the stop button in Director should normally work; but in those rare cases like sometimes happens in small retail companies like you work in, I'm sure there are times when what normally or should work... simply doesn't work.

In one of those cases, what steps should be taken to stop the job?

I know the first step, like we said, is to issue the stop from Director, then I'm expecting some effort to cleanup resources and finally when all else fails issue the kill command(s). But, specifically what are the right steps between the stop in director and the kill at the command line.

And, which of these need to be done on the head node(s) vs the compute nodes?
Bob
daignault
Premium Member
Premium Member
Posts: 165
Joined: Tue Mar 30, 2004 2:44 pm
Contact:

Post by daignault »

I very rarely have problems with the stop on director..... maybe about 5 times.

grep for the jobname. You will see the shell executing, then the su to your execution userid, then dsjob command executing.

I usually do the following:

kill -15 bin/sh

So that should be close to the top of the tree. Make sure you keep track of the ultimate parent of the job. In my case, I grep the ppid of each process until I find the top.

The reason I use kill -15 is this is a "SIGINT" or Software Interupt. Datastage will look for this signal and wrap up the processes in a normal manner. If you look at your log entries when you hit the "stop" button on the director, you will see an entry for SIGINT terminating the process.

Hope this helps

Ray D
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

Bob,
"Especially one that may not respond to the stop button in Director". You should use Resource Manager's web console to stop the job.

Ray,
"looking for osh or osh.exe jobs on all compute nodes and kill lingering processes." Resouce Manager software (PBS Pro) provides the function to do that. Think about it what if you have hundreds of compute nodes.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Well, if you knew that you killed the job, then you know which server it got dispatched to.

Optionally, since the orphaned osh code tends to happen at small retailers such as the one Ray works for... you could automate the task and find old osh code that is older than X amount of days and simply kill those pids too.
Post Reply