Page 1 of 1

Issues Aborting Jobs

Posted: Thu Dec 21, 2017 9:12 am
by jackson.eyton
Hi Guys,
I've been running into some issues lately with debugging jobs. Sometimes I will use the debug breakpoints
and review the data at that point, then stop the job to adjust for whatever I've seen there.

Recently I've been having an issue telling a running job to stop, where the job will not stop. This happens
to me both in the debug and when just running a job normally. The log will show that the SIGINT and SIGTERM
and SIGKILL signals are sent to the process but it never ends. The job indeed stops processing data but it
stays in a running state.

I've left a job in this state overnight and it was still like this. I CAN get it to finally die if I use the
Cleanup Resources option in the Director, then logoff the process associated with the job.

Here is where the REALLY annoying issue comes into play. After I have done this cleanup, 50% of the time,
the job will fail to run ever again until the Engine server is rebooted. The following log is one such job
that I am having this issue with currently. I've opened a case with IBM on it but so far they've not responded
in two days.

The Job log can be found here:
https://raw.githubusercontent.com/jacks ... rorlog.txt

Posted: Fri Dec 22, 2017 9:51 am
by asorrell
These kind of issues are usually quite difficult to diagnose. Have you checked for project corruption with SyncProject recently?

The reason you are having to reboot the engine to restore job operation is that some resource tied to the job, like a semaphore
or lock, is not being cleared by the Cleanup Resources option. Rebooting is clearing that, and the job runs again.

I don't think I'd worry about figuring out what resource is being tied up, that's really a symptom. The problem is really the
job hanging when you attempt to stop it.

There's no way for you to diagnose that kind of issue without customer service. What the are going to need you to do is run a
stack trace against a hung job. That will tell them the internal routine currently being run by the job. At that point
they'll have to contact engineering and get them to say what is being executed by that routine.

If you haven't already, I'd suggest running an ISALite on your server, and attaching that, along with the Version.xml's, job dsx
and job logs to your IBM ticket. Engineering won't even look at your problem unless they have an ISALite output to tell them
it is in good working order. They'll need the Version.xml to know exactly what code base to look at internally during diagnosis.

Side note - the weird pagination on your post is caused by the PATH statements with no statements, in a "code" block.
The browser doesn't want to mess with the code, even to insert line breaks, so the window gets VERY wide...

Posted: Fri Dec 22, 2017 11:45 am
by chulett
Yeah, I usually go in and nuke those lines from the post since they don't really add any value. Just wasn't in the mood when I had the time and didn't have the time when I was in the mood. :wink:

Posted: Fri Dec 22, 2017 3:13 pm
by jackson.eyton
Thank you both, I have edited my original post so that the job log is in a better suited location, I usually think of that and can't really remember what I was/wasn't thinking for this post so my apologies there.

I have run the health check for IBM and will continue to work with them. I am not familiar with SyncProject however.