Problems with DS Calls

Viswanath · Post by **Viswanath** » Thu May 27, 2004 8:20 am

Hi All,

We have facing a couple of problems while trying to call some DataStage jobs or sequencers from Control M using command files.

1.) While trying to call a job from control M using a command file, every job is set to reset first and is then run. Now sometimes the reset dooesnt finish and keeps running. Now a wait of 12 seconds is given between the reset and the run time. However due to the problme stated above, my job eventually fails with a DSBadState = 2. Any idea why this happens. Basically at this point i have to stop the reset and rerun the job. The next time the job run ok.

2.) We faced this as a one off problem but i am still trying to figure out the root cause. There has been an occurence wherein a sequencer trying to call some jobs suddenly crashes with the following error.

Controller problem: Error calling DSRunJob(JobA), code=-14
[Timed out while waiting for an event]

Any idea why this has happened?

ANy help would be great.

Cheers,

Amos.Rosmarin · Post by **Amos.Rosmarin** » Thu May 27, 2004 8:39 am

Hi,

Are you using dsjob to execute the jobs ?
I suggest you write yuorself a little script the does the error handling for you. first use -mode RESET and then run.
Do not execute a job directly, wrap it in a sequencer and use 'reset if required then run' .

HTH,
Amos

Viswanath · Post by **Viswanath** » Thu May 27, 2004 8:55 am

Hi,

I am using dsjob to run these jobs and all cases i do have have sequencer with the "Reset if required then run" option. Excpet in one case where in a job is called instead of a sequencer.

kduke · Post by **kduke** » Thu May 27, 2004 11:04 am

It is my experience that reset a job takes 30 seconds if the system is overloaded. A sequence should handle it for you. I expect that you need to have an otherwise link in your sequence. Not all situations are trapped in a sequence if all you have is a OK or successful link and an error link then the sequence ends if you have a warning. You need an error and an otherwise link or a OK and errors are the otherwise link.

kiran_kom · Post by **kiran_kom** » Fri May 28, 2004 1:09 pm

I've had the same problem this morning. It might be because your are using the "reset if required" option. I had inadvertedly used it in my sequence and ran into the same issue. Try taking it out. It seemed to have solved mine, but im not a 100% sure, my jobs are still running.

Also are you by any chance making a heavy usage of hash files in any of those jobs ?? there is a bug in DS windows that causes it to crash if you are using lots of hash file stages at the same time.

ray.wurlod · Post by **ray.wurlod** » Fri May 28, 2004 5:20 pm

kiran_kom wrote:There is a bug in DS windows that causes it to crash if you are using lots of hash file stages at the same time.

Can you please elaborate, ideally providing a reference to the support case number? Or is it just that you didn't set the T30FILE tunable large enough?

kiran_kom · Post by **kiran_kom** » Fri May 28, 2004 6:09 pm

ray.wurlod wrote:
kiran_kom wrote:There is a bug in DS windows that causes it to crash if you are using lots of hash file stages at the same time.
Can you please elaborate, ideally providing a reference to the support case number? Or is it just that you didn't set the T30FILE tunable large enough?

Umm no...This is a known issue with DS windows (well known only to Ascential folks I guess). We have a jobs that make heavy usage of hash files and there are multiple instances of the same job running.

this sometimes causes DS to crash. the error manifests itself as a "User limit reached" message in the &PH& directory. Ascential is working on a fix for it.

Yesterday when I was having the above mentioned problem, I also ran into this problem with hash files. I was
I dont think the above problem is related to this hash file issue. Because just now (5 mins back) my jobs failed with the same controller problem (and not they didnt have "reset if required" turned on.) I didnt find any of the "User limit reached" messages in &PH&, so I guess this is a seperate issue.

kiran_kom · Post by **kiran_kom** » Fri May 28, 2004 6:14 pm

ray.wurlod wrote:
kiran_kom wrote:There is a bug in DS windows that causes it to crash if you are using lots of hash file stages at the same time.
Can you please elaborate, ideally providing a reference to the support case number? Or is it just that you didn't set the T30FILE tunable large enough?

the support case number is 385112*WES

rdy · Post by **rdy** » Wed Jun 30, 2004 8:07 am

Viswanath wrote:Hi All,

2.) We faced this as a one off problem but i am still trying to figure out the root cause. There has been an occurence wherein a sequencer trying to call some jobs suddenly crashes with the following error.

Controller problem: Error calling DSRunJob(JobA), code=-14
[Timed out while waiting for an event]

Did you ever resolve #2? We have the same problem occasionally and Ascential is pointing us to the shared memory parameters on our Solaris box. If you look on page 3-4 of the installation guide, they list the minimum recommended values.

You can see those values on a Solaris system by running /etc/sysdef and grep out the parm you're looking for. E.g. /etc/sysdef | grep SHMMNI.

I was told that SHMMNI was probably the culprit on our system. I'll let you know if it helps after we've made the change.

ogmios · Post by **ogmios** » Wed Jun 30, 2004 9:39 am

A little bit off topic, but my hat off to the guesses of Ascential support. And that's meant ironically.

They'd better build some more tracing in DataStage as e.g. is the case in Oracle. Something in Oracle goes wrong, you make an iTar and 99% of the times you get a real fix very soon.

Ogmios

smohamme · Post by **smohamme** » Mon Sep 13, 2004 1:34 pm

smohamme wrote:
rdy wrote:
Viswanath wrote:Hi All,

2.) We faced this as a one off problem but i am still trying to figure out the root cause. There has been an occurence wherein a sequencer trying to call some jobs suddenly crashes with the following error.

Controller problem: Error calling DSRunJob(JobA), code=-14
[Timed out while waiting for an event]

Did you ever resolve #2? We have the same problem occasionally and Ascential is pointing us to the shared memory parameters on our Solaris box. If you look on page 3-4 of the installation guide, they list the minimum recommended values.

You can see those values on a Solaris system by running /etc/sysdef and grep out the parm you're looking for. E.g. /etc/sysdef | grep SHMMNI.

I was told that SHMMNI was probably the culprit on our system. I'll let you know if it helps after we've made the change.

Hello

I was wondering whether you fixed issue #2. We have been getting this for a week and Ascential has not been able to solve it. Obviously we are using Datastage 6.x running on Solaris. I will try your suggestion too and see what happens? Also what should the SHMMNI be set at?

Thank you!

ogmios · Post by **ogmios** » Mon Sep 13, 2004 1:47 pm

At one site where they ran DataStage on Solaris we fixed this problem by changing the order of some of the shared libraries in the dsenv file on recommendation of Ascential... but only after we send them our truss file of the job in action.

I don't anymore which shared libraries and the order of it.

Ogmios

smohamme · Post by **smohamme** » Mon Sep 13, 2004 4:20 pm

ogmios wrote:At one site where they ran DataStage on Solaris we fixed this problem by changing the order of some of the shared libraries in the dsenv file on recommendation of Ascential... but only after we send them our truss file of the job in action.

I don't anymore which shared libraries and the order of it.

Ogmios

smohamme wrote: Thank you! If you can, please provide more details like the shared libraries and their order.

smohamme · Post by **smohamme** » Thu Sep 23, 2004 2:34 pm

smohamme wrote:
ogmios wrote:At one site where they ran DataStage on Solaris we fixed this problem by changing the order of some of the shared libraries in the dsenv file on recommendation of Ascential... but only after we send them our truss file of the job in action.

I don't anymore which shared libraries and the order of it.

Ogmios

smohamme wrote: Thank you! If you can, please provide more details like the shared libraries and their order.

We have added the following library path in the dsenv file:

LD_LIBRARY_PATH=/usr/lib/lwp:$LD_LIBRARY_PATH; export LD_LIBRARY_PATH

and we have changed the DSRunJob.B file since it was corrupt. This did not fix our "Time Out..." issue. Ascential also informed us that the Production ETL box is over utlized. Since we instantiate our jobs 20 times, we reduced it to instantiate 12 times and after these were complete, start another run with the other 8 times. This has worked, although it does not explain why on our Dev box (which is smaller in processing power/memory) this works fine with the 20 instantiations. The difference between the 2 in the uvconfig file is:

(production) (development)
1. MFILES 200 MFILES 50
2. T30FILE 2000 T30FILE 500
3. UVSYNC 0 UVSYNC 1
4. 64BIT_FILES 1 64BIT_FILES 0

DSXchange

Problems with DS Calls

Problems with DS Calls

Re: Problems with DS Calls

Re: Problems with DS Calls

Re: Problems with DS Calls

Re: Problems with DS Calls

Re: Problems with DS Calls

Re: Problems with DS Calls