Page 1 of 1

Zombies UVSH

Posted: Fri Nov 28, 2003 5:53 pm
by ariear
Hi ALL,
It appears that in some situations DataStage server processes becomes zoombies. It's quite difficult to reproduce this problem (Even with Ascential support) but maybe there's an answear somewhere ???!!!
It can happen on any platform but mostly on W2K.
If a job that has a large write cache enabled is stopped (VIA director) in the middle of fulshing its buffers (It can be seen VIA monitor that the pumping of records is stopped and rate is decreasing) - the job status is set to aborted instead of stopped - probably the UVSH becames a zoombie ! :evil:
Or if a lookup is using a large cache enabled and it's stopped (VIA director) in the middle of building its RAM structure - :evil:
An ODBC that runs a heavy query and is stopped before a result set has been recieved :evil:

Or another variation is that the job has got a stopped status but it's UVSH is still running (Sometimes you can get log messages after the stopped entry) and evetually terminating but after a long delay. What happens is that one can think that the job has really stopped and for some good reason he issues a re-compilation (Even if there's a UVSH on air) and you can get very complicated situations like Syncronization errors and some times even successful jobs that doesn't really run etc. :evil:

Any help on this one :?:

Posted: Fri Nov 28, 2003 7:47 pm
by kcbland
Anytime you stop jobs via Director, or jobs abort, you should always make sure that the failed jobs don't leave threads out there.

Code: Select all

$ ps -ef |grep phantom
  radnet 11779 11776  0 09:02:02 ?        0:14 phantom DSD.StageRun loadupdIRCashIVDayAg. loadupdIRCashIVDayAg.xfm 3 0/0
  radnet  1761  1760  2 08:56:27 ?       23:18 phantom DSD.RUN Batch::MasterControlIROrderDetail. 0 ParameterFile=/var/opt/dat
  radnet  4992 18088  0 10:14:57 pts/14   0:00 grep phantom
  radnet 11776  1761  0 09:02:02 ?        0:00 phantom DSD.RUN loadupdIRCashIVDayAg. 0/0 SourceFileDirectory=/var/opt/datastag
Every job has a DSD.RUN thread you can consider as the "MAIN" thread, and all active stages within the job show up as DSD.StageRun. Any time jobs have an issue, you should check to make sure that there are no DSD.StageRun threads active. This is simple on unix, in fact, you can write a shell script to check "ps" to make sure all DSD.StageRun processes have a corresponding DSD.RUN.

The DSD.StageRun threads are your mysterious zombies. You can safely kill these threads. You should also recompile or clear the status of the jobs after doing so. Sometimes a zombie will interfere with the next run of the job. The job will startup and finish immediately, with no work done, and a successful state.

phantom

Posted: Fri Nov 28, 2003 8:58 pm
by PhantomSquawk
remember this, it is the phrase to get in to the special meetings. Say it to the doorman: "The phantom squawks at midnight"

Posted: Sun Nov 30, 2003 9:01 am
by ray.wurlod
The password is "swordfish".

The password is ALWAYS "swordfish".

Posted: Sun Nov 30, 2003 3:27 pm
by ariear
Thanks for the confimation Ken, (And for the anigmatic passwords).
Any good practices except NOT STOPPING JOBS USING DIRECTOR :?: OR CAREFULLY CHECK AFTER INEVITABLE STOPS :!:

Posted: Sun Nov 30, 2003 9:45 pm
by kcbland
You're on Windoze, so if don't have a unix command interpreter like MKS toolkit you won't have a process status command like "ps". If you have NT Resource Kit, the closest equivalent will be pview.exe. This will allow you to see the individual processes in a job.

As for the swordfish reference, I had to do a google search. That reference is a little before my time. I always liked the Stooges better. Still don't know the phantom reference, though.

As far as best practices, there's nothing like standard recovery procedures. If your job dies because of one of the following, expect zombies to occasionally occur. For SERVER jobs, this has been my experience for 5+ years:

(1) Job database connection was dropped midstream
(2) Job database instance had peculiar error
(3) User STOPped job from Director in the middle of database query
(4) DBA killed job query in databsae
(5) DataStage project filesystem filled to capacity
(6) Job attempted a mathemetical expression where one of the equation components contained a NULL value
(7) Job (DS 5+) modified an argument value in a passed user-defined FUNCTION call without copying the argument to a local variable
(8) Job (DS 5+) used the BASIC STATUS() function, this one was weird, in worked in some versions and not others

So, compile a list of the types of job crashes that produced zombies. Then, develop a wysiwyg script/whatever to clean the process table of DSD.StageRun threads without DSD.RUN parents. Or just whack them by hand. This is difficult if your ETL application is turned over to a 24x7 operations center. Better off with the script approach.

Posted: Mon Dec 01, 2003 6:22 am
by ray.wurlod
kcbland wrote:
As for the swordfish reference, I had to do a google search. That reference is a little before my time. I always liked the Stooges better. Still don't know the phantom reference, though.
The Marx Brothers may have started before the Three Stooges (who were originally a vaudeville act - did you know that?), but they were still going - and therefore contemporaries, and dare I suggest, just a tad more cerebral in their humour? But that's starting off a low base!

Earlier this year a US project manager of a large project in India showed three Stooges films (off DVD) one lunchtime to the great bemusement of all present.

Posted: Mon Dec 01, 2003 6:24 am
by ray.wurlod
kcbland wrote:Still don't know the phantom reference, though.
DSD.RUN invokes DSD.StageRun with

Code: Select all

PHANTOM SQUAWK DSD.StageRun { command_line_options }
SQUAWK is a synonym in the VOC for REPORTING. It used to cause gales of laughter as the preferred option for COPY for Prime INFORMATION.

Code: Select all

COPY FROM file1 TO file2 ALL SQUAWK
With the PHANTOM verb, and NOTIFY ON in effect, it causes (forces?) the child process to notify the parent, which is where the [Done] message in the &PH& record for the job comes from. Now you know.

Posted: Mon Dec 01, 2003 2:36 pm
by ariear
O.K.
Is there any sense in writing a deamon that checks the universe process table for orphans DSD.StageRun ? A check after each job that terminates under Batch/Sequencer control ? If I understand correctly this architecture a batch job will appear only as DSD.RUN.
Is the dslictool clean_lic -a command applicable to PHANTOM processes also ?

Thanks Guys !

Re: phantom

Posted: Mon Dec 01, 2003 7:46 pm
by kcbland
PhantomSquawk wrote:remember this, it is the phrase to get in to the special meetings. Say it to the doorman: "The phantom squawks at midnight"
Thanks for the info Ray, but this is the reference I don't get. Care to clue me in?

Posted: Tue Dec 02, 2003 10:04 am
by ray.wurlod
Nope, that one's got me, too. :oops:
PhantomSquawk, care to enlighten the world?