Lengthy start up time for 90% of datastage jobs in 11.5.0.2

PaulVL · Post by **PaulVL** » Mon Aug 21, 2017 9:56 am

That's a lot of eyeballs.

With everyone updating by default every 5 seconds... that's a lot of traffic.

As a test, have most of those operations folks log out and just 1 admin eyeball a slow job to see if it speeds up.

qt_ky · Post by **qt_ky** » Tue Sep 19, 2017 8:31 am

Did this get resolved?

arvind_ds · Post by **arvind_ds** » Wed Oct 11, 2017 3:49 pm

Not yet. Disk re-configuration done. Separate volume groups created for all major file systems, with each volume group(VG) having non shareable SSD disks allocated. All file systems mounted over SAN.

eg

VG1 : /opt/IBM/InfomationServer (disk 1, disk 2)
VG2 : /ETL/Datasets (disk 3, disk 4, disk 5, disk 6)
VG3 : /ETL/Scratch (disk 7, disk 8, disk 9, disk 10)
VG4 : /ETL/Projects (disk 11, disk 12)
VG5 : /ETL/UVTEMP (disk 13, disk 14)
VG6 :/ETL/data_files (disk 15, disk 16)

No clue about what to do next.

chulett · Post by **chulett** » Wed Oct 11, 2017 7:08 pm

Okay... and? You've detailed the "re-configuration" you've done but made no mention of how that affected your issue. Where do things stand now?

arvind_ds · Post by **arvind_ds** » Wed Oct 11, 2017 11:27 pm

We had raised a PMR earlier which is still open. It was suggested by PMR engineers(after adding multiple debug variables at project level and then running the jobs and collecting the log files and sharing with IBM support) that disk contention is causing slowness in 11.5.0.2, it was observed through the debug log files shared with them.

Now we have done disk re-configuration and replaced HDD with SSD as well. Performance issue is still there.

Jobs which use to take X amount of time in version 9.1.2, are now taking average 5X to 6X time in version 11.5.0.2 across AIX 7.x platform.

What is the role of pd_npages AIX parameter wrt performance of the datastage jobs.

We noticed the value of this parameter as below.

(1) In version 9.1.2/AIX 7.1 TL2, the value of pd_npages on engine tier = 65536

(2) In version 11.5.0.2/AIX 7.1 TL4, the value of pd_npages on engine tier = 4096

Rest of the AIX parameters are same across both the engine tier servers across 9.1.2 and 11.5.0.2.

To conclude, disk re-configuration didn't help us to resolve the issue so far.

Please share your inputs.

chulett · Post by **chulett** » Thu Oct 12, 2017 6:53 am

File system buffer tuning. Now, how those effect "performance of DataStage jobs" I have no idea.

qt_ky · Post by **qt_ky** » Thu Oct 12, 2017 7:34 am

As indicated by Craig's Knowledge Center link, lowering the pd_npages setting can help performance related to real-time applications that delete files. Here is what command help says about the same setting:

Code: Select all

# ioo -h pd_npages
Help for tunable pd_npages:
Purpose:
Specifies the number of pages that should be deleted in one chunk from RAM when a file is deleted.
Values:
        Default: 4096
        Range: 1 - 524288
        Type: Dynamic
        Unit: 4KB pages
Tuning:
The maximum value indicates the largest file size, in pages. Real-time applications that experience sluggish response time while files are being deleted. Tuning this option is only useful for real-time applications. If real-time response is critical, adjusting this option may improve response time by spreading the removal of file pages from RAM more evenly over a workload.

All of our systems (v11.3.x and v11.5.0.2+SP2) are running with the default value 4096. We run a mix of mostly DataStage jobs along with a small number of real-time DataStage jobs using ISD, none of which involve deleting files. I would not know how much difference this particular setting might make without changing it, rebooting, and testing a typical workload. We have never had to tweak this setting.

Here are a number of other ideas:

Compare the lparstat -i output across servers. Does it look as expected with memory and CPU allocations? Can you share the output (minus the server names)?

What are your LPAR priority values set to (Desired Variable Capacity Weight) and do they match across LPARs?

Are you sharing a physical server where any of the other LPARs might be overloaded or allowed to use the default shared CPU pool (all of the cores)?

Could you put a workload on that doesn't involve any disk I/O and do some comparisons across servers? Something like a row generator stage, set to run in parallel (default is sequential), to a transformer that does some mathematical functions... If that runs well, then run another test job that does local disk I/O but does not touch Oracle, and so on.

For what it may be worth, here is how we have /etc/security/limits set, which is a bit different than yours:

Code: Select all

default:
        stack_hard = -1
        fsize = -1
        core = -1
        cpu = -1
        data = -1
        rss = -1
        stack = -1
        nofiles = -1

It may not be practical but have you considered a workaround, such as a daily reboot, at least on the engine tier?

Is AIX process accounting active or enabled? We found that everything ran better when it was disabled.

Since you have reconfigured your disks at least once or twice, have you gone back and changed GLTABSZ from default 75 to 300 and run the uvregen? Several tech notes suggest increasing RLTABSZ/GLTABSZ/MAXRLOCK values to 300/300/299, especially when multi-instance jobs are used.

Just to confirm, is the performance problem only limited to parallel job startup time only, and not the actual run-time after startup, and not affecting sequence jobs or server jobs?

arvind_ds · Post by **arvind_ds** » Thu Nov 09, 2017 9:47 am

Thank you all. The issue has been resolved successfully. Followed the below steps as suggested by IBM support. What a relief !

(1) Enable release-behind(RBRW) on the filesystems - datasets and scratch and re-mount the file systems across AIX server.This will let the system return the memory to the freelist as soon as the application has read/written it, rather than keeping them into memory(default behavior).
(2) Use "noatime" option on the above file systems.This will let the system not update the file access time (for example when you run an "ls"), but the modification time will always be updated when the file data or metadata are modified.
(3) Change the LIBPATH setting so that Aix directories will be searched first in dsenv file.
(4) Increase the receive buffers for ent0 virtual I/O ethernet adapter from 512 to 1024.
(5) Disable Disk I/O History.

PaulVL · Post by **PaulVL** » Thu Nov 09, 2017 4:09 pm

Were you by chance running with a Veritas Clustered File System?

arvind_ds · Post by **arvind_ds** » Sat Nov 11, 2017 12:24 am

No all the file systems were JFS2/SAN mounted.