Lengthy start up time for 90% of datastage jobs in 11.5.0.2

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Do you have the new Work Load Manager turned off?
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Lengthy start up time for 90% of datastage jobs in 11.5.

Post by chulett »

It's a bit buried in the original post...
arvind_ds wrote:Work load management(WLM) is disabled in our environment.
-craig

"You can never have too many knives" -- Logan Nine Fingers
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Hi Arvind_ds,
Wondering if you get to resolved the issue? A lot of us could learn from this situation ....

We have a similar environment with 11.5.0.1 and Oracle 11g for XMETA, and we are planning to apply Fix Patch 2 and migrating the schemas to Oracle 12 c soon, just to provide the new features on the Governance Catalog to our Governance people.... so trying to find out what was the root cause of your issues and if it was related to the oracle 12c and/or Fix Patch 2

Please let us know
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Problem still NOT resolved. I will make sure to update this post once we get a permanent solution.

Sev 1 PMR in place. Following up closely with IBM Customer Support. Exchanged lot of log files in last 2 weeks with them.

They suggested to do disk re-configuration on the AIX server where DS engine is installed, making it similar to what we had in the old 9.1.2 environment.

We have done the disk re-configuration and the situation has improved slightly(20% gain in performance).

Now the Customer support is suggesting to increase the CPUs on the datastage engine by 50%. This is in progress.

Will keep you all posted.
Arvind
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So... as a scope check is just startup time the issue? Meaning, once it actually gets going disk access isn't an issue? I only ask because the one time in the past when we had similar issues which required 'reconfiguring the file system' we were using, disk access was crap all around. And everything sprang back to life once the file system settings were set correctly.
-craig

"You can never have too many knives" -- Logan Nine Fingers
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Adding more CPU is nuts and the IBM guy who is recommending that is a newb. Not to mention you've just signed up for 50% more licensing cost because of a product shortcoming. This is not LOAD based issue. I bet if you nmon your box and look that the CPU load during your slowness you'll prove that.

Thumbs down on the increase in your cores.
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Ok PaulVL. So do you think that its a product short coming? Kindly share some details if you have also observed similar issue in the upgraded version of InfoSphere Information Server 11.5.0.2

We are literally struggling. Appreciate your inputs.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Well, based upon your described symptoms, you slow down over time. You did not indicate that there were a lot of concurrently running jobs running at the time. That implies that your CPU should not be pushing it's limits.

Run NMON on your box and capture the stats every X amount of minutes. "Reset" your environment to make it fast again... then let it slow down over time. Afterwards you look at the CPU consumption of the box during that timeframe and determine if you need more cores.

I suspect that you will not need more cores.


One thing that would be helpful is to ensure we are talking about a common view of what you are describing as slow startup time.

Please detail that interpretation, and be descriptive.

- did the DSD.RUN start?
- Did the osh start up?
- What is the log saying?
- Any database connections linked yet?
- etc...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You might also consider enabling reporting environment variables such as APT_STARTUP_STATUS and APT_PLAYER_TIMINGS to capture some figures about how long things are taking and resource consumption.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Thank you all for your valuable inputs. We tried settings below variables at project level, as advised by PMR support engineers.

APT_DEBUG_CLEANUP=1
APT_NO_JOBMON=1
APT_SHOW_COMPONENT_CALLS=1
APT_PM_PLAYER_TIMING=1
APT_NO_PM_SIGNAL_HANDLERS=1
CORE_NAMING=true
APT_DUMP_SCORE=true
APT_PM_SHOW_PIDS=true
APT_STARTUP_STATUS=true
CC_MSG_LEVEL=2
APT_DISABLE_COMBINATION=true

The problem is still not resolved completely. After doing multiple tests and sharing the log files(director logs and stack trace logs) with PMR support, the issue is pointing more towards disk configuration across all file systems used in the Engine tier.

First round of disk re-configuration is completed and we have ovserved 30 to 40 % improvement in performance of the jobs. Jobs are no more hanging now post disk re-configuration BUT yes they are still slow when compared to 9.1.2 environment. Both the environments are identical wrt capacity now.

We are targeting to further fine tune the disk re-configuration. We are aiming to setup the non shareable HDD disks across different file systems on the engine tier(below file systems).

(1) Scratch : Scratch file system
(2) DataSets : datasets file system
(3) TMPDIR : File system corresponding to TMPDIR variable
(4) Project Plus : This one is used to store application specific data files and scripts.
(5) Project : DS projects are created under this file system
(6) /opt/IBM/InformationServer : Datastage binaries on this one.

At present, some of the file systems are using shared disks underneath. Eg datasets and scratch file systems are sharing same set of HDD disks(~10 disks each of size 500 GB). Similarly Project plus and Project and TMPDIR file systems are sharing another set of disks(But different from scratch and datasets disks).

In addition, we are also targeting to replace HDD with SSD.

Will keep you posted.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

I still don't think that is the issue. You said your slowness happens over time.

Are you sure that at the time of slowness you don't have multiple jobs running?

Meaning... once you go slow.. .if you only had one job running... would he be still be slow?
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

If we leave the system as such at the time of slowness, the jobs used to run till completion, the only issue is that if any job which is supposed to finish in eg 30 minutes(baseline 9.1.2 run of same job against same data volume), it takes 10X more time in 11.5

Yes, at the time of slowness - multiple jobs are running in parallel, slowness happens with time BUT now after disk re-configuration, the slowness is still there BUT it has reduced from 10X to 2X wrt time.
No jobs gets aborted when the system experiences slowness, it just take more time to finish.

Another thing is that whenever it goes slow, the jobs will still appear in RUNNING stage but it will appear to end users as if the system is in HUNG stage because the job monitor will not show any progress for a long time.

At slowness, the jobs will finish after taking longer time and when only 1 or 2 jobs are left in the batch then they will complete nicely. These 1 or 2 jobs in the end of the batch don't experience any more slowness.

We tried to browse through IGC at the time of slowness, we queried XMETA with all possible options in IGC, no slowness observed there. We executed ISALite general health check also(at the time of slowness), it finished fine within 10 minutes. No issues reported in the ISALite report either.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Are there a lot of tourists in the environment at the time of slowness? (dsapi slaves from your ops folks all seeking to eyeball the slow jobs)
attu
Participant
Posts: 225
Joined: Sat Oct 23, 2004 8:45 pm
Location: Texas

Post by attu »

How many instances of a job are you running? Have you captured the I/O when the problem occurs?


Thanks
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Tourist count at the time of slowness is ~15 and total number of jobs running in parallel at that time is around 15 to 20 with 25% of the jobs being multiple instance jobs.
Arvind
Post Reply