19 jobs failed with ds_ipcgetnext

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
PaulS
Premium Member
Premium Member
Posts: 45
Joined: Fri Nov 05, 2010 4:38 am

19 jobs failed with ds_ipcgetnext

Post by PaulS »

Hi sorry abut the title - i didn't want this dismissed with the usuall "timeout waiting for mutex" answer! I know about row/buffers etc...

Anyway, heres the problem. On Friday, one job, a pull from a SQL server db, failed with ""timeout waiting for mutex". I reset/reran it failed pretty much straightway, with other jobs running. I reran it on saturday morning, again with other jobs, it completed successfully.

Tonight, the same job,.. along with 18 others (in the same category) failed with mutex errors. I have other jobs in other categories running to completion, without issue. Just these jobs in this particular category failed - all from the same source,.. all pretty much failed at the same time.

As well as the ds_ipcput(), I'm also getting an ds_ipcgetnext() thereafter. Most server jobs have this error, most have it in a CInterProcess Stage.

Nothing has changed, no upgrades to DS or its jobs. I can't speak for the source system however.

Any help very much appreciated!!!

Thanks in advance
Paul
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Re: 19 jobs failed with ds_ipcgetnext

Post by SURA »

Just a general questions:

1) Are these jobs are running for a while without any issues?

2) Though there is no changes in the DS side any changes made in network / SQL Server / OS?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This is an example of a case where you may benefit from slightly increasing the buffer timeout value.

It's still related to total load on the machine, but if you can allow the IPC buffers a bit more grace time, you *should* get fewer timeouts.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

What's the timeout setting on the IPC stages? Are your jobs hitting this timeout value?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

One of the reasons I very rarely used them darn IPC stages. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
PaulS
Premium Member
Premium Member
Posts: 45
Joined: Fri Nov 05, 2010 4:38 am

Re: 19 jobs failed with ds_ipcgetnext

Post by PaulS »

SURA wrote:Just a general questions:

1) Are these jobs are running for a while without any issues?

2) Though there is no changes in the DS side any changes made in network / SQL Server / OS?
Yes,... these jobs have been running in 8.5 since i upgraded in April. No previous issues until Fridays one job failure,.. now all these 19.

We've had no network or o/s changes - not sure about the DB. I wasn't informed of any changes.

All te IPC stages are at defaults,..
Buffer: 128kb
Timeout: 10secs
Yes, it looks as though they are hitting the 10 secs and erroring

I can up all of them - but i don't understand why they're failing now. The category has 195 jobs,.. 19 failed wit this error.
PaulS
Premium Member
Premium Member
Posts: 45
Joined: Fri Nov 05, 2010 4:38 am

Post by PaulS »

one quick question - would this error every be thrown if the connection to the source database dropped?
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Re: 19 jobs failed with ds_ipcgetnext

Post by Kryt0n »

You mentioned all 19 were hitting the same DB, are these the only ones to hit that DB ? If so, your culprit is almost certainly a change on the DB front. How long do the queries take to run in a DB client? What database is it?

Is there any load on either the DataStage server or the DB server when trying to run the jobs? How many of these are you running at one time?
PaulS
Premium Member
Premium Member
Posts: 45
Joined: Fri Nov 05, 2010 4:38 am

Post by PaulS »

I have probably 7 or so jobs running simaltaniously into the same database. The only ones which are causing a problem are the IPC jobs.

Sorry,.. I'm starting to understan this a little more.

I got our unix administrator to report out the process activity over the period. There was a massive cpu spike at the time the jobs started to go wrong. The data didn't show load, how many processes were waiting, but I suspect given the utilisation of the 4 cpu's this were the problem is.

I am going to up the projects timeout parameter to 20 seconds... I have some questions which looked to not be answered here...
:(
jdsmith575210
Participant
Posts: 14
Joined: Mon Jan 19, 2009 9:06 pm

Post by jdsmith575210 »

We saw problems with IPC stages whenever the column metadata (datatype, length, display) defined in the stage didn't match what was coming from the source. Correcting the metadata helped but never resolved all of our problems. In the end, we removed the IPC stages whenever a job would fail with this error.

I don't see any mention of what you upgraded from or what database you're using, but we experienced lots of strange errors when upgrading from 7.5.2 to 8.1. You may want to look what what patches you had installed on your previous version and see if something similar needs to be applied to 8.5. We struggled for a year before discovering a patch that needed to be applied to our new environment.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

PaulS wrote:I am going to up the projects timeout parameter to 20 seconds... I have some questions which looked to not be answered here...
:(
Nope; to me, you are pushing the issue , not solving it. Based on my understanding, if the load delayed due to network traffic or any other reasons you will face this issue again.

I am not sure about your job design.

1) Write the data into a file and the use a separate load job could resolve (95%).
2) Replace the IPC with file
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
PaulS
Premium Member
Premium Member
Posts: 45
Joined: Fri Nov 05, 2010 4:38 am

Post by PaulS »

We also upgraded from 7.5.2.. I hit mutex errors in 8.5 in job using a link partitioner/collector. First time I've seen it in an IPC.

From the mass of documents I've read, it appears IPC are more trouble than they are worth. Unfortuneately my category has 195 jobs, each with two IPCs.. I'm not about to re-write them all.

I've been looking at the sequencer and instead of uping the timeouts, I'm going to resequence the calling of the jobs. I have 7 strands running simaltaniously,.. the whole sequence takes 25mins, couple of the strands complete in 10 mins. I'll combine them and take some of the weight off the early period of heavy utilisation. There is some scope to smooth it out further if needed.

Thanks for everyones help here - very much appreciated!

Paul
Post Reply