DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1167

Points: 7709

Post Posted: Mon Jul 31, 2017 8:42 am Reply with quote    Back to top    

Ask your AIX admin and the networking guy about how the LPAR internal lan is set up between the LPAR image and the actual physical network card.

You might be on a 10GB network going out of the card, but between your host image and the card sits a 1GB internal I-LAN configuration. This is typically done because of standards. Since the 1 (or two) network cards have to service multiple LPARs, they (system admins) often configure the inter communication to be a 1GB connection.
Rate this response:  
asorrell
Site Admin

Group memberships:
Premium Members, DSXchange Team, Inner Circle, Server to Parallel Transition Group

Joined: 04 Apr 2003
Posts: 1637
Location: Colleyville, Texas
Points: 22256

Post Posted: Tue Aug 01, 2017 12:12 pm Reply with quote    Back to top    

Do you have the new Work Load Manager turned off?

_________________
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2017
Rate this response:  
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42273
Location: Denver, CO
Points: 217068

Post Posted: Tue Aug 01, 2017 12:15 pm Reply with quote    Back to top    

It's a bit buried in the original post...

arvind_ds wrote:
Work load management(WLM) is disabled in our environment.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
JRodriguez



Group memberships:
Premium Members

Joined: 19 Nov 2005
Posts: 400
Location: New York City
Points: 4376

Post Posted: Thu Aug 10, 2017 10:03 am Reply with quote    Back to top    

Hi Arvind_ds,
Wondering if you get to resolved the issue? A lot of us could learn from this situation ....

We have a similar environment with 11.5.0.1 and Oracle 11g for XMETA, and we are planning to apply Fix Patch 2 and migrating the schemas to Oracle 12 c soon, just to provide the new features on the Governance Catalog to our Governance people.... so trying to find out what was the root cause of your issues and if it was related to the oracle 12c and/or Fix Patch 2

Please let us know

_________________
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Sat Aug 12, 2017 3:23 am Reply with quote    Back to top    

Problem still NOT resolved. I will make sure to update this post once we get a permanent solution.

Sev 1 PMR in place. Following up closely with IBM Customer Support. Exchanged lot of log files in last 2 weeks with them.

They suggested to do disk re-configuration on the AIX server where DS engine is installed, making it similar to what we had in the old 9.1.2 environment.

We have done the disk re-configuration and the situation has improved slightly(20% gain in performance).

Now the Customer support is suggesting to increase the CPUs on the datastage engine by 50%. This is in progress.

Will keep you all posted.

_________________
Arvind
Rate this response:  
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42273
Location: Denver, CO
Points: 217068

Post Posted: Sat Aug 12, 2017 1:56 pm Reply with quote    Back to top    

So... as a scope check is just startup time the issue? Meaning, once it actually gets going disk access isn't an issue? I only ask because the one time in the past when we had similar issues which required 'reconfiguring the file system' we were using, disk access was crap all around. And everything sprang back to life once the file system settings were set correctly.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1167

Points: 7709

Post Posted: Mon Aug 14, 2017 7:52 am Reply with quote    Back to top    

Adding more CPU is nuts and the IBM guy who is recommending that is a newb. Not to mention you've just signed up for 50% more licensing cost because of a product shortcoming. This is not LOAD based issue. I bet if you nmon your box and look that the CPU load during your slowness you'll prove that.

Thumbs down on the increase in your cores.
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Mon Aug 14, 2017 12:49 pm Reply with quote    Back to top    

Ok PaulVL. So do you think that its a product short coming? Kindly share some details if you have also observed similar issue in the upgraded version of InfoSphere Information Server 11.5.0.2

We are literally struggling. Appreciate your inputs.

_________________
Arvind
Rate this response:  
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1167

Points: 7709

Post Posted: Mon Aug 14, 2017 2:27 pm Reply with quote    Back to top    

Well, based upon your described symptoms, you slow down over time. You did not indicate that there were a lot of concurrently running jobs running at the time. That implies that your CPU should not be pushing it's limits.

Run NMON on your box and capture the stats every X amount of minutes. "Reset" your environment to make it fast again... then let it slow down over time. Afterwards you look at the CPU consumption of the box during that timeframe and determine if you need more cores.

I suspect that you will not need more cores.


One thing that would be helpful is to ensure we are talking about a common view of what you are describing as slow startup time.

Please detail that interpretation, and be descriptive.

- did the DSD.RUN start?
- Did the osh start up?
- What is the log saying?
- Any database connections linked yet?
- etc...
Rate this response:  
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54071
Location: Sydney, Australia
Points: 293279

Post Posted: Thu Aug 17, 2017 12:35 am Reply with quote    Back to top    

You might also consider enabling reporting environment variables such as APT_STARTUP_STATUS and APT_PLAYER_TIMINGS to capture some figures about how long things are taking and resource consumption.

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Thu Aug 17, 2017 12:49 pm Reply with quote    Back to top    

Thank you all for your valuable inputs. We tried settings below variables at project level, as advised by PMR support engineers.

APT_DEBUG_CLEANUP=1
APT_NO_JOBMON=1
APT_SHOW_COMPONENT_CALLS=1
APT_PM_PLAYER_TIMING=1
APT_NO_PM_SIGNAL_HANDLERS=1
CORE_NAMING=true
APT_DUMP_SCORE=true
APT_PM_SHOW_PIDS=true
APT_STARTUP_STATUS=true
CC_MSG_LEVEL=2
APT_DISABLE_COMBINATION=true

The problem is still not resolved completely. After doing multiple tests and sharing the log files(director logs and stack trace logs) with PMR support, the issue is pointing more towards disk configuration across all file systems used in the Engine tier.

First round of disk re-configuration is completed and we have ovserved 30 to 40 % improvement in performance of the jobs. Jobs are no more hanging now post disk re-configuration BUT yes they are still slow when compared to 9.1.2 environment. Both the environments are identical wrt capacity now.

We are targeting to further fine tune the disk re-configuration. We are aiming to setup the non shareable HDD disks across different file systems on the engine tier(below file systems).

(1) Scratch : Scratch file system
(2) DataSets : datasets file system
(3) TMPDIR : File system corresponding to TMPDIR variable
(4) Project Plus : This one is used to store application specific data files and scripts.
(5) Project : DS projects are created under this file system
(6) /opt/IBM/InformationServer : Datastage binaries on this one.

At present, some of the file systems are using shared disks underneath. Eg datasets and scratch file systems are sharing same set of HDD disks(~10 disks each of size 500 GB). Similarly Project plus and Project and TMPDIR file systems are sharing another set of disks(But different from scratch and datasets disks).

In addition, we are also targeting to replace HDD with SSD.

Will keep you posted.

_________________
Arvind
Rate this response:  
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1167

Points: 7709

Post Posted: Thu Aug 17, 2017 3:42 pm Reply with quote    Back to top    

I still don't think that is the issue. You said your slowness happens over time.

Are you sure that at the time of slowness you don't have multiple jobs running?

Meaning... once you go slow.. .if you only had one job running... would he be still be slow?
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Fri Aug 18, 2017 6:45 am Reply with quote    Back to top    

If we leave the system as such at the time of slowness, the jobs used to run till completion, the only issue is that if any job which is supposed to finish in eg 30 minutes(baseline 9.1.2 run of same job against same data volume), it takes 10X more time in 11.5

Yes, at the time of slowness - multiple jobs are running in parallel, slowness happens with time BUT now after disk re-configuration, the slowness is still there BUT it has reduced from 10X to 2X wrt time.
No jobs gets aborted when the system experiences slowness, it just take more time to finish.

Another thing is that whenever it goes slow, the jobs will still appear in RUNNING stage but it will appear to end users as if the system is in HUNG stage because the job monitor will not show any progress for a long time.

At slowness, the jobs will finish after taking longer time and when only 1 or 2 jobs are left in the batch then they will complete nicely. These 1 or 2 jobs in the end of the batch don't experience any more slowness.

We tried to browse through IGC at the time of slowness, we queried XMETA with all possible options in IGC, no slowness observed there. We executed ISALite general health check also(at the time of slowness), it finished fine within 10 minutes. No issues reported in the ISALite report either.

_________________
Arvind
Rate this response:  
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1167

Points: 7709

Post Posted: Fri Aug 18, 2017 8:05 am Reply with quote    Back to top    

Are there a lot of tourists in the environment at the time of slowness? (dsapi slaves from your ops folks all seeking to eyeball the slow jobs)
Rate this response:  
attu
Participant



Joined: 23 Oct 2004
Posts: 223
Location: Texas
Points: 1745

Post Posted: Fri Aug 18, 2017 9:56 pm Reply with quote    Back to top    

How many instances of a job are you running? Have you captured the I/O when the problem occurs?


Thanks
Rate this response:  
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours