DS Server Restart

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
kcshankar
Charter Member
Charter Member
Posts: 91
Joined: Mon Jan 10, 2005 2:06 am

DS Server Restart

Post by kcshankar »

Hi,
We want to Restart the DS Server.
We know that everybody has logged out.
When we execute netstat -na |grep 31538 command,
we found that one user is still connected to DS.
tcp XXXXX 100.200.20.30 100.200.10.15.1740 .....ESTABLISHED.
Iam sure that nobody is using 100.200.10.15.
How is it possible?
Is there any command to get more details about that connection?
Unless I stop/kill that process i can't proceed further :( .


regards
kcs
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You got the ip 100.200.10.15.
Check with the system if any application is still running?
If not, run $DSHOME/bin/list_readu to the list of locks.
It might be a deamon running behind.
Try to UNLOCK ALL.
use ps -eaf | grep dscs to get the user id who holds the lock (
Probably you may not see any :wink: )
Then kill the process.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It's not always possible to clear all users and you may not personally have the permissions that you need to do so. In that case, stop the server which will kill the connection... eventually. In most cases. :wink:

I've had to do that many times for various and sundry reasons. Once the server is down, start checking to see what state any remaining busy sockets are in:

Code: Select all

netstat -a |grep dsrpc
Ideally, this will produce an empty list. In case it doesn't, you'll see sockets in states like FIN_WAIT1, FIN_WAIT2, LAST_ACK (etc), with the first two being the most typical. Access to an Admin or root privledges can help speed up the process, but on my H-PUX box they usually free themselves, drain out in 5 to 15 minutes. I've only had one time where I needed help from an SA to free one last socket that seemed like it was going to hang on forever, hung with a status I hadn't seen before and don't recall. :?

Once I've seen the sockets, I usually switch to a short version that just shows the count involved, waiting for it to go to zero:

Code: Select all

netstat -a |grep dsrpc|wc -l
Don't even think about restarting if this doesn't return zero - the engine will come up but the dsrpcd listener will 'bind bomb' and not be able to start. Meaning, you will not be able to connect from any of your client installations. You can grep for it as a final check of successful engine start. If you do this and it is not running, just stop it again and wait longer. If people get impatient, see if an SA or other Root Wielder can help.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcshankar
Charter Member
Charter Member
Posts: 91
Joined: Mon Jan 10, 2005 2:06 am

Post by kcshankar »

Hi Kumar,Craig
Thanks for your reply.
We exeute ps -fe | grep dsapi_slave | more to find the process id of this connection.
we got user2 7623 Sep12 .......ESTABLISHED.

We kill that process(7623) through root.
When we execute netstat -na |grep 31538,we got
user2 7623 Sep12 .......FIN_WAIT_1.

Does it mean that this process is running since sep12.
Can we stop the DS Server,when the process is showing FIN_WAIT_1?


regards
kcs
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Yes it was running since Sep 12.
And... FIN_WAIT_1 ---> Waiting to dye :wink:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The process should disappear in (at worst) 90 minutes depending upon your UNIX settings. I'm not sure what the network configuration parameter for that socket timeout is right now, though. I think that there isn't much to do between FIN_WAIT and FIN_CLOSED status; although each hardware vendor / UNIX flavor has some means of doing this. If you did a google search on your OS and FIN_WAIT you might get some help in how to get rid of the sockets prior to their natural timeouts.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

We kill that process(7623) through root.
Aarrggh! Did you check to see whether pid 7362 had any child processes before you killed it? Killing a parent can turn a nohup child into a zombie, and they're almost impossible to get rid of without re-booting UNIX.

And killing a process does not, at least not immediately, release any associated socket, particularly if it is in a wait state. If you're on Solaris, you can use the ndd command to find out what the TCP timeout is. There are similar commands in other UNIX variants, but none in my head if the moment.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kcshankar
Charter Member
Charter Member
Posts: 91
Joined: Mon Jan 10, 2005 2:06 am

Post by kcshankar »

Hi,
Sorry for responding late.
After we got user2 7623 Sep12 .......FIN_WAIT_1 .
we restarted the server.
When we check for netstat -na |grep 31538 ,it returned nothing.
DS is working fine now.
Did you check to see whether pid 7362 had any child processes before you killed it? Killing a parent can turn a nohup child into a zombie, and they're almost impossible to get rid of without re-booting UNIX.
We didn't check whether pid 7623 had any child process.
I think we can find the child process using awk script .
This has to be executed in root.

Can you tell me,
If the child process is still there,
What kind of problem it will create?
Is re-booting Unix is the only solution?


regards
kcs
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Yes, but not always.
If the child process left orphaned, and the socket may still be open (though the actuall process is not running in server). Which will leave out the zombie process. Hence if you start the server before the the connection dies, RPC daemon may not start. And it may give you error code - 81016. And which most likely yeild to reboot the unix machine.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply