Lengthy start up time for 90% of datastage jobs in 11.5.0.2

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

That's a lot of eyeballs.

With everyone updating by default every 5 seconds... that's a lot of traffic.

As a test, have most of those operations folks log out and just 1 admin eyeball a slow job to see if it speeds up.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Did this get resolved?
Choose a job you love, and you will never have to work a day in your life. - Confucius
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Not yet. Disk re-configuration done. Separate volume groups created for all major file systems, with each volume group(VG) having non shareable SSD disks allocated. All file systems mounted over SAN.

eg

VG1 : /opt/IBM/InfomationServer (disk 1, disk 2)
VG2 : /ETL/Datasets (disk 3, disk 4, disk 5, disk 6)
VG3 : /ETL/Scratch (disk 7, disk 8, disk 9, disk 10)
VG4 : /ETL/Projects (disk 11, disk 12)
VG5 : /ETL/UVTEMP (disk 13, disk 14)
VG6 :/ETL/data_files (disk 15, disk 16)

No clue about what to do next.
Arvind
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Okay... and? You've detailed the "re-configuration" you've done but made no mention of how that affected your issue. Where do things stand now?
-craig

"You can never have too many knives" -- Logan Nine Fingers
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

We had raised a PMR earlier which is still open. It was suggested by PMR engineers(after adding multiple debug variables at project level and then running the jobs and collecting the log files and sharing with IBM support) that disk contention is causing slowness in 11.5.0.2, it was observed through the debug log files shared with them.

Now we have done disk re-configuration and replaced HDD with SSD as well. Performance issue is still there.

Jobs which use to take X amount of time in version 9.1.2, are now taking average 5X to 6X time in version 11.5.0.2 across AIX 7.x platform.

What is the role of pd_npages AIX parameter wrt performance of the datastage jobs.

We noticed the value of this parameter as below.

(1) In version 9.1.2/AIX 7.1 TL2, the value of pd_npages on engine tier = 65536

(2) In version 11.5.0.2/AIX 7.1 TL4, the value of pd_npages on engine tier = 4096

Rest of the AIX parameters are same across both the engine tier servers across 9.1.2 and 11.5.0.2.

To conclude, disk re-configuration didn't help us to resolve the issue so far.

Please share your inputs.
Arvind
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

File system buffer tuning. Now, how those effect "performance of DataStage jobs" I have no idea.
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

As indicated by Craig's Knowledge Center link, lowering the pd_npages setting can help performance related to real-time applications that delete files. Here is what command help says about the same setting:

Code: Select all

# ioo -h pd_npages
Help for tunable pd_npages:
Purpose:
Specifies the number of pages that should be deleted in one chunk from RAM when a file is deleted.
Values:
        Default: 4096
        Range: 1 - 524288
        Type: Dynamic
        Unit: 4KB pages
Tuning:
The maximum value indicates the largest file size, in pages. Real-time applications that experience sluggish response time while files are being deleted. Tuning this option is only useful for real-time applications. If real-time response is critical, adjusting this option may improve response time by spreading the removal of file pages from RAM more evenly over a workload.
All of our systems (v11.3.x and v11.5.0.2+SP2) are running with the default value 4096. We run a mix of mostly DataStage jobs along with a small number of real-time DataStage jobs using ISD, none of which involve deleting files. I would not know how much difference this particular setting might make without changing it, rebooting, and testing a typical workload. We have never had to tweak this setting.

Here are a number of other ideas:

Compare the lparstat -i output across servers. Does it look as expected with memory and CPU allocations? Can you share the output (minus the server names)?

What are your LPAR priority values set to (Desired Variable Capacity Weight) and do they match across LPARs?

Are you sharing a physical server where any of the other LPARs might be overloaded or allowed to use the default shared CPU pool (all of the cores)?

Could you put a workload on that doesn't involve any disk I/O and do some comparisons across servers? Something like a row generator stage, set to run in parallel (default is sequential), to a transformer that does some mathematical functions... If that runs well, then run another test job that does local disk I/O but does not touch Oracle, and so on.

For what it may be worth, here is how we have /etc/security/limits set, which is a bit different than yours:

Code: Select all

default:
        stack_hard = -1
        fsize = -1
        core = -1
        cpu = -1
        data = -1
        rss = -1
        stack = -1
        nofiles = -1
It may not be practical but have you considered a workaround, such as a daily reboot, at least on the engine tier?

Is AIX process accounting active or enabled? We found that everything ran better when it was disabled.

Since you have reconfigured your disks at least once or twice, have you gone back and changed GLTABSZ from default 75 to 300 and run the uvregen? Several tech notes suggest increasing RLTABSZ/GLTABSZ/MAXRLOCK values to 300/300/299, especially when multi-instance jobs are used.

Just to confirm, is the performance problem only limited to parallel job startup time only, and not the actual run-time after startup, and not affecting sequence jobs or server jobs?
Choose a job you love, and you will never have to work a day in your life. - Confucius
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Thank you all. The issue has been resolved successfully. Followed the below steps as suggested by IBM support. What a relief !

(1) Enable release-behind(RBRW) on the filesystems - datasets and scratch and re-mount the file systems across AIX server.This will let the system return the memory to the freelist as soon as the application has read/written it, rather than keeping them into memory(default behavior).
(2) Use "noatime" option on the above file systems.This will let the system not update the file access time (for example when you run an "ls"), but the modification time will always be updated when the file data or metadata are modified.
(3) Change the LIBPATH setting so that Aix directories will be searched first in dsenv file.
(4) Increase the receive buffers for ent0 virtual I/O ethernet adapter from 512 to 1024.
(5) Disable Disk I/O History.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Were you by chance running with a Veritas Clustered File System?
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

No all the file systems were JFS2/SAN mounted.
Arvind
Post Reply