How to Stop jobs on a grid

bobyon · Post by **bobyon** » Wed Dec 19, 2012 2:05 pm

Are there any special considerations when sttempting to stop jobs on a grid?

I've seen several posts on here regarding how to stop and how not to stop DataStage jobs. And, I know kill-9 falls in the center of how NOT to stop a job.

What I have not seen is anything that mentions if there are special steps to take or precautions to consider when stopping a grid enabled job. Especially one that may not respond to the stop button in Director.

Thanks,
Bob

Post by **daignault** » Wed Dec 19, 2012 2:11 pm

Usually just the stop button on the director will stop the job. We do go on a weekly basis looking for osh or osh.exe jobs on all compute nodes and kill lingering processes.

Also, we make sure we purge the scratch directory. Many times the tsort will leave a bunch of files on the system.

Thanks in advance

Ray D

bobyon · Post by **bobyon** » Wed Dec 19, 2012 2:41 pm

Ray,

Thanks for the reply. I agree the stop button in Director should normally work; but in those rare cases like sometimes happens in small retail companies like you work in, I'm sure there are times when what normally or should work... simply doesn't work.

In one of those cases, what steps should be taken to stop the job?

I know the first step, like we said, is to issue the stop from Director, then I'm expecting some effort to cleanup resources and finally when all else fails issue the kill command(s). But, specifically what are the right steps between the stop in director and the kill at the command line.

And, which of these need to be done on the head node(s) vs the compute nodes?

Post by **daignault** » Wed Dec 19, 2012 2:55 pm

I very rarely have problems with the stop on director..... maybe about 5 times.

grep for the jobname. You will see the shell executing, then the su to your execution userid, then dsjob command executing.

I usually do the following:

kill -15 bin/sh

So that should be close to the top of the tree. Make sure you keep track of the ultimate parent of the job. In my case, I grep the ppid of each process until I find the top.

The reason I use kill -15 is this is a "SIGINT" or Software Interupt. Datastage will look for this signal and wrap up the processes in a normal manner. If you look at your log entries when you hit the "stop" button on the director, you will see an entry for SIGINT terminating the process.

Hope this helps

Ray D

lstsaur · Post by **lstsaur** » Wed Dec 19, 2012 7:32 pm

Bob,
"Especially one that may not respond to the stop button in Director". You should use Resource Manager's web console to stop the job.

Ray,
"looking for osh or osh.exe jobs on all compute nodes and kill lingering processes." Resouce Manager software (PBS Pro) provides the function to do that. Think about it what if you have hundreds of compute nodes.

PaulVL · Post by **PaulVL** » Tue Dec 25, 2012 1:20 pm

Well, if you knew that you killed the job, then you know which server it got dispatched to.

Optionally, since the orphaned osh code tends to happen at small retailers such as the one Ray works for... you could automate the task and find old osh code that is older than X amount of days and simply kill those pids too.