DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Fri Aug 18, 2017 11:21 pm Reply with quote    Back to top    

Tourist count at the time of slowness is ~15 and total number of jobs running in parallel at that time is around 15 to 20 with 25% of the jobs being multiple instance jobs.

_________________
Arvind
Rate this response:  
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1166

Points: 7706

Post Posted: Mon Aug 21, 2017 9:56 am Reply with quote    Back to top    

That's a lot of eyeballs.

With everyone updating by default every 5 seconds... that's a lot of traffic.

As a test, have most of those operations folks log out and just 1 admin eyeball a slow job to see if it speeds up.
Rate this response:  
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2665
Location: USA
Points: 19822

Post Posted: Tue Sep 19, 2017 8:31 am Reply with quote    Back to top    

Did this get resolved?

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Wed Oct 11, 2017 3:49 pm Reply with quote    Back to top    

Not yet. Disk re-configuration done. Separate volume groups created for all major file systems, with each volume group(VG) having non shareable SSD disks allocated. All file systems mounted over SAN.

eg

VG1 : /opt/IBM/InfomationServer (disk 1, disk 2)
VG2 : /ETL/Datasets (disk 3, disk 4, disk 5, disk 6)
VG3 : /ETL/Scratch (disk 7, disk 8, disk 9, disk 10)
VG4 : /ETL/Projects (disk 11, disk 12)
VG5 : /ETL/UVTEMP (disk 13, disk 14)
VG6 :/ETL/data_files (disk 15, disk 16)

No clue about what to do next.

_________________
Arvind
Rate this response:  
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42268
Location: Denver, CO
Points: 217033

Post Posted: Wed Oct 11, 2017 7:08 pm Reply with quote    Back to top    

Okay... and? You've detailed the "re-configuration" you've done but made no mention of how that affected your issue. Where do things stand now?

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Wed Oct 11, 2017 11:27 pm Reply with quote    Back to top    

We had raised a PMR earlier which is still open. It was suggested by PMR engineers(after adding multiple debug variables at project level and then running the jobs and collecting the log files and sharing with IBM support) that disk contention is causing slowness in 11.5.0.2, it was observed through the debug log files shared with them.

Now we have done disk re-configuration and replaced HDD with SSD as well. Performance issue is still there.

Jobs which use to take X amount of time in version 9.1.2, are now taking average 5X to 6X time in version 11.5.0.2 across AIX 7.x platform.

What is the role of pd_npages AIX parameter wrt performance of the datastage jobs.

We noticed the value of this parameter as below.

(1) In version 9.1.2/AIX 7.1 TL2, the value of pd_npages on engine tier = 65536

(2) In version 11.5.0.2/AIX 7.1 TL4, the value of pd_npages on engine tier = 4096

Rest of the AIX parameters are same across both the engine tier servers across 9.1.2 and 11.5.0.2.

To conclude, disk re-configuration didn't help us to resolve the issue so far.

Please share your inputs.

_________________
Arvind
Rate this response:  
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42268
Location: Denver, CO
Points: 217033

Post Posted: Thu Oct 12, 2017 6:53 am Reply with quote    Back to top    

File system buffer tuning. Now, how those effect "performance of DataStage jobs" I have no idea.

_________________
-craig

Watch out where the huskies go and don't you eat that yellow snow
Rate this response:  
qt_ky



Group memberships:
Premium Members

Joined: 03 Aug 2011
Posts: 2665
Location: USA
Points: 19822

Post Posted: Thu Oct 12, 2017 7:34 am Reply with quote    Back to top    

As indicated by Craig's Knowledge Center link, lowering the pd_npages setting can help performance related to real-time applications that delete files. Here is what command help says about the same setting:

Code:
# ioo -h pd_npages
Help for tunable pd_npages:
Purpose:
Specifies the number of pages that should be deleted in one chunk from RAM when a file is deleted.
Values:
        Default: 4096
        Range: 1 - 524288
        Type: Dynamic
        Unit: 4KB pages
Tuning:
The maximum value indicates the largest file size, in pages. Real-time applications that experience sluggish response time while files are being deleted. Tuning this option is only useful for real-time applications. If real-time response is critical, adjusting this option may improve response time by spreading the removal of file pages from RAM more evenly over a workload.


All of our systems (v11.3.x and v11.5.0.2+SP2) are running with the default value 4096. We run a mix of mostly DataStage jobs along with a small number of real-time DataStage jobs using ISD, none of which involve deleting files. I would not know how much difference this particular setting might make without changing it, rebooting, and testing a typical workload. We have never had to tweak this setting.

Here are a number of other ideas:

Compare the lparstat -i output across servers. Does it look as expected with memory and CPU allocations? Can you share the output (minus the server names)?

What are your LPAR priority values set to (Desired Variable Capacity Weight) and do they match across LPARs?

Are you sharing a physical server where any of the other LPARs might be overloaded or allowed to use the default shared CPU pool (all of the cores)?

Could you put a workload on that doesn't involve any disk I/O and do some comparisons across servers? Something like a row generator stage, set to run in parallel (default is sequential), to a transformer that does some mathematical functions... If that runs well, then run another test job that does local disk I/O but does not touch Oracle, and so on.

For what it may be worth, here is how we have /etc/security/limits set, which is a bit different than yours:

Code:
default:
        stack_hard = -1
        fsize = -1
        core = -1
        cpu = -1
        data = -1
        rss = -1
        stack = -1
        nofiles = -1


It may not be practical but have you considered a workaround, such as a daily reboot, at least on the engine tier?

Is AIX process accounting active or enabled? We found that everything ran better when it was disabled.

Since you have reconfigured your disks at least once or twice, have you gone back and changed GLTABSZ from default 75 to 300 and run the uvregen? Several tech notes suggest increasing RLTABSZ/GLTABSZ/MAXRLOCK values to 300/300/299, especially when multi-instance jobs are used.

Just to confirm, is the performance problem only limited to parallel job startup time only, and not the actual run-time after startup, and not affecting sequence jobs or server jobs?

_________________
Choose a job you love, and you will never have to work a day in your life. - Confucius
Rate this response:  
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Thu Nov 09, 2017 9:47 am Reply with quote    Back to top    

Thank you all. The issue has been resolved successfully. Followed the below steps as suggested by IBM support. What a relief !

(1) Enable release-behind(RBRW) on the filesystems - datasets and scratch and re-mount the file systems across AIX server.This will let the system return the memory to the freelist as soon as the application has read/written it, rather than keeping them into memory(default behavior).
(2) Use "noatime" option on the above file systems.This will let the system not update the file access time (for example when you run an "ls"), but the modification time will always be updated when the file data or metadata are modified.
(3) Change the LIBPATH setting so that Aix directories will be searched first in dsenv file.
(4) Increase the receive buffers for ent0 virtual I/O ethernet adapter from 512 to 1024.
(5) Disable Disk I/O History.

_________________
Arvind
Rate this response:  
Not yet rated
PaulVL



Group memberships:
Premium Members

Joined: 17 Dec 2010
Posts: 1166

Points: 7706

Post Posted: Thu Nov 09, 2017 4:09 pm Reply with quote    Back to top    

Were you by chance running with a Veritas Clustered File System?
Rate this response:  
Not yet rated
arvind_ds
Participant



Joined: 16 Aug 2007
Posts: 428
Location: Manali
Points: 4200

Post Posted: Sat Nov 11, 2017 12:24 am Reply with quote    Back to top    

No all the file systems were JFS2/SAN mounted.

_________________
Arvind
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours