Long Startup Time and Hang Issue in DataStage

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
roy.abhishek
Participant
Posts: 2
Joined: Sun Jun 02, 2019 1:28 pm

Long Startup Time and Hang Issue in DataStage

Post by roy.abhishek »

Hi All,

We are using DataStage 11.7 on Windows 2012 server environment. DataStage is running on a 3-tier architecture (1. Services and Engine, 2. DB2 Metadata Repository, 3. Clients)- each on Windows systems.

We have recently moved into production. We have 3 major batches which contain multiple child sequences and parallel jobs with multiple instance set-up. In each of these batches, multiple jobs are designed to be run in parallel.
1. Batch_Cyle_Hrs (Expected to run every x hrs in a day)
2. Batch_Nightly (Expected to run daily basis)
3. Batch_Weekly (Expected to run weekly basis)

Since we moved our code into production, we were facing multiple issues when jobs were getting aborted randomly - multiple times within a same batch cycle and even on different cycles. Some of the issues which we were facing are as following -
1. Could not run job job_name in reset mode (code=-14)
Error calling DSPrepareJob(job_name)
2. DB2_Connector_Stage_name: Internal Error: (item != NULL): processmgr/connections.C: 246
3. main_program: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for players to confirm startup. This likely indicates a network problem. Status from APT_PMpoll is 0; node name is node1
4. main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.

Thereafter we changed parameter values in the uvconfig file as below.
- MFILES 200 -> 250
- GLTABSZ 75 -> 100
- RLTABSZ 300 -> 400
- MAXRLOCK 299 -> 399
- SYNCALOC 1 -> 0
We also made changes to the Windows registry except the boot.ini configuration file (Ref: https://www.ibm.com/support/knowledgece ... e_win.html)

Also, we have done custom setting for Windows Virtual Memory on the DataStage project and installation drive as below -
[Initial: 16384 MB (16 GB) and Maximum: 24576 MB (24 GB)]

After all above changes were done, occurrence of above issues reduced to minimal; but performance took a hit. Batches which used to take 3-4 hrs to complete were taking 9-10 hrs. After a few steps of hit-and-trial and some troubleshooting activities, it was observed that performance going down with a typical trend.

1. Cleanup of &PH& and other temp directories and batch is executed. No change in performance.
2. Cleanup of &PH& and other temp directories followed by a server restart. Batch completed within stipulated time.
3. Batch executed without any cleanup or server restart. With each cycle, execution time increases.
Overall, batch is completing within expected time-frame if a server restart is done between cycles. If not doing any restart, then total run-time increases gradually with every run.

We have checked resource performance from Task Manager and have also extracted nmon statistics. CPU usage remained on the higher side during every cycle, other usage was normal. The DataStage server is running on 24 GB RAM and 4 core system.

Apart from that, we are also facing a few environmental issues randomly during batch execution. DataStage goes into not responding state (both from Designer and run-time environment) all of a sudden. At the same time if we try to open a stage / connector on Designer, it takes long time across all clients and at the same time, job execution also gets stuck. It takes 5-10 minutes and then stages across all clients get opened all together. Also, we are observing that jobs which are running during that time get aborted - often without any proper error message or time-out error.

We have also raised a PMR on this but yet to receive any fruitful solution.

I have checked the forums for similar issues. Either we have performed those fixes suggested or the problem couldn't be mapped to other issues as scenarios are different.

Please help.

Best Regards,
Abhishek
Post Reply