job control process (pid xxxx) has failed

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

job control process (pid xxxx) has failed

Post by wuruima »

I met a warning msg "job control process (pid xxxx) has failed" and then the job abort. After search in the IBM, I found this.

Problem(Abstract)

Sequence job control process (pid xxxx) has failed

Cause


Sequence job run continuously in a loop, appends to the dsenv after each run, causing the length of your LD_LIBRARY_PATH (Sun/Linux), LIBPATH (AIX), LIB_PATH (HPUX) environment variable, to exceeded 8192 bytes.

Diagnosing the problem

If, after actioning steps in Technote http://www-01.ibm.com/support/docview.w ... wg21397247, the issue persists and you are running a Sequence job continuously in a loop, then the next action is to check the length of your LD_LIBRARY_PATH (Sun/Linux), LIBPATH (AIX), LIB_PATH (HPUX) environment variable, ensure the length this string has NOT exceeded 8192 bytes.

If it has, then the likely cause is that the dsenv is being sourced continuously in a loop as well.

Resolving the problem

Set the environment settings outside the loop (or) set the absolute-strings (such as "LD_LIBRARY_PATH=<all-paths>", but do not append this with :$LD_LIBRARY_PATH, which can cause the path-settings to get repeated on multiple-runs & finally cause the crash.
wuruimao
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Re: how to understand this error

Post by wuruima »

I design a seq job, which only contains a routine. In the routine, firstly I trigger job A, B, C, D to run one by one.(use a for loop)
And then I have a for loop from 1-9, to submit job index1...index9 to run parallelly.

This is the log where it abort.

[info]a..JobControl (DSRunJob): Waiting for job index1 to start
[warn]Job control process (pid 28967492) has failed
wuruimao
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Re: how to understand this error

Post by wuruima »

I simply rerun the job ,without change. Now it's processing job 1-9. no error.
wuruimao
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So... the sequence job itself had the PID failure or one of the jobs it attempted to run had the failure? For the latter, anything in that job's log? :?

For an intermittent error like this, something you can't reproduce, in your shoes I would involve support.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

However, I will say that in my experience when you see something like this:

A fails. Sometime later with no changes or intervention, A runs fine.

This is usually resource related. As in a lack thereof.
-craig

"You can never have too many knives" -- Logan Nine Fingers
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

Yes recently the DS env encounter out of resource problem sometimes.

The error msg will have some words like "resource", however the error msg above is not easy to understand.
wuruimao
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

[info]a..JobControl (DSRunJob): Waiting for job index1 to start
[warn]Job control process (pid 28967492) has failed

After the log, nothing special but shows the sequence job is abort.
wuruimao
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

We actually dislike this kind of support tickets. "It failed, and then work again, do our work for us!"

Get a consultant to help diagnosis the system issue, if you do not have the appropriate resource that is skilled enough to do an evaluation of your server. Do not lean on IBM Support without specific details, "Why is running x, y, z producing action a, b, c on this server?"

Tickets that complain that it failed then worked, with no further investigation done, will most likely require specific consulting assistance to be done. It is your server, which is so unlike most of our other customers' servers, with different settings, configurations, and software installed. We need you to investigate how you set it up, and find out what is going on on the system level, before we can help explain the why.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

My money (2 cents) is on the ulimit of the user id running the job. Check the nofile value. I am guessing it is the default 1024. Which is way to low for an ETL environment.
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

thanks for ur long response.
The job was failed with a message I could not understand, eventhough I get the explaination in the IBM website, I could not make it clear, that's why I send the post here. I suspect this is a server resource issue, but who knows. After the rerun the job resumed, I just want to know "what the error means".
Teej wrote:We actually dislike this kind of support tickets. "It failed, and then work again, do our work for us!"

Get a consultant to help diagnosis the system issue, if you do not have the appropriate resource that is skilled enough to do an evaluation of your server. Do not lean on IBM Support without specific details, "Why is running x, y, z producing action a, b, c on this server?"

Tickets that complain that it failed then worked, with no further investigation done, will most likely require specific consulting assistance to be done. It is your server, which is so unlike most of our other customers' servers, with different settings, configurations, and software installed. We need you to investigate how you set it up, and find out what is going on on the system level, before we can help explain the why.
wuruimao
Post Reply