Semaphores are building up to a very high number

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ulab
Participant
Posts: 56
Joined: Mon Mar 16, 2009 4:58 am
Location: bangalore
Contact:

Semaphores are building up to a very high number

Post by ulab »

Hello DS Friends,

we faced an issue, that the jobs are running long in XYZ server. On checking we found that the jobs are not going to the Grid queue. and we saw an Error: Port Library failed to initialize, Could not create the Java virtual machine,
After googleing on this issue we got to know that the issue is with the semaphores when the value reaches 1000000(ipcs -rsa | wc -l), now my question to all my DS friends is

what is the reason/cause that keeps semaphore issue to happen? we are badly impacted with this issue in all environments, and this happens every week irrespective of the environment, the only work arround we have now is failover the server to secondary node/server and re-boot the primary server. please put in your valuable thoughts and experiences on the root cause of semaohores,

NOTE: submited a PMR to IBM but still no resolution/other work arround from them yet.
Ulab----------------------------------------------------
help, it helps you today or Tomorrow
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

An internet search will reveal to you that a semaphore is a place for a program to wait for some event to occur (even for an amount of time to elapse). Semaphores are implemented in different ways on different platforms, but in all cases they should be released once finished with. If they aren't, then there's some problem with the application - you really will need to wait for your official support to help to diagnose, since there are many different processes making up most DataStage applications.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

What version of DataStage are you running?

The ODBC manager on a particular release had some semaphore issues which caused them to never get released.

I arn into that at my previous client outside of Boston. 9.1 FP2 resolved the issue I believe, but I left before we applied it to PROD (the only environment that had the issue). We had to execute a manual cleanup until the FP2 was installed.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Are you on AIX? My current client has this issue (they are running 8.7 on AIX 6.1) and we are working with IBM to get a fix. Apparently the issue has been identified, but there is no resolution at this time.

We investigated using ipcrm to remove the semaphores, but there were considerable drawbacks / risks involved. Frequent system re-boots were deemed the safest workaround.

I wrote a script that checks the ipcs count every 30 minutes and writes it out to disk so we can see how the count is creeping up. On our very busy system the ipcs command will occasionally time out, causing it to return an error instead of a semaphore list, but other than that it works well.

Code: Select all

while [ 1 -eq 1 ]
  do
    echo `date;ipcs -rs | wc -l` >> /home/asorrell/ipcs.out
    sleep 1800
  done
I just submit this in the background with the nohup option and let it run till we reboot.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

ipcrm -ruo <userid #1>
ipcrm -ruo <userid #2>
etc...

That command will clear out the unreleased semaphores. Should leave the currently used ones unaffected.

My old client was on AIX as well. IBM had indicated that the issue was associated with a bad ODBC driver manager. "Should be fixed" in FP2 for 9.1. But I never got a chance to validate that.

I can ask my old team mates to see if they applied the patch yet.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

We thought about using the same ipcrm command but the majority of our jobs are run by dsadm. IBM Engineering said that they could not guarantee that using ipcrm against dsadm on a running system would not have any side-effects, so we decided to skip that and reboot. I still think it would be safe, but if IBM said no, I wasn't willing to put my neck on the line to try it.

Now if you are on a system where most of the jobs are run under individual user-ids, using the ipcrm command against user-id's that are currently not logged in (but have left semaphores behind) should be 100% safe.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Post Reply