Lengthy start up time for 90% of datastage jobs in 11.5.0.2

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Lengthy start up time for 90% of datastage jobs in 11.5.0.2

Post by arvind_ds »

Hi Experts,

We recently migrated from IIS 9.1.2 to 11.5.0.2 version across AIX 7.1 platform, non CDB oracle 12c metadata repository. We have 4 tier architecture(1 for each - services, engine, metadata repository and 4th one - the client tier). Services tier, engine tier and metadata repository tier- each one on a separate AIX LPAR.

We migrated the DS jobs from old version 9.1.2 to 11.5.0.2 in the form of releases (R1 and R2). We started with R1 release and migrated close to 500 DS jobs. Jobs started running fine on new environment (11.5.0.2).

Then after 1 week, we migrated second set of jobs (R2) with close to 4500 jobs. Now total jobs (R1 + R2 = 5000) started running on the newly upgraded platform. For first two days, all jobs ran fine within expected SLA. After two days, extreme slowness observed in almost 90% of the jobs with lengthy startup time.

During peak load, we observed slowness in the overall performance of all running jobs at a particular time interval, all the jobs started appearing in hanging stage (with status as running in operations console). At this time total count of running jobs parallely was around 60. If we kill one or two jobs at this stage, immediately 1 or 2 jobs (which were in running state since last 1 hour) get finished successfully with job startup time eg as 70 minutes and production run time as 10 seconds. Workload management (WLM) is disabled in our environment.

Interestingly if we run the same job separately(no other job running on the engine), the job startup time comes down to around 4 seconds and production run time 10 seconds. We have enough capacity on the engine tier(16 CPUs - P8 series, 256 GB RAM, tera bytes of SAN disk space allocated across multiple file systems (for scratch, datasets, staging files etc), each project allocated dedicated TMPDIR (~20 GB on SAN). No issue with &PH& and /tmp. No disk contention, normal nmon stats for CPUs, disk usage, processes, RAM etc.

We raised a PMR with IBM customer support. They suggested to add below parameter in dsenv file to handle lenghty startup time of ds jobs.

APT_CONNECTION_PORT_RANGE=0

http://www-01.ibm.com/support/docview.w ... wg1JR55721

The above change in dsenv, didn't help. PMR follow up is still going on with no fruitful results so far.

Additionally, we took reboot of all tiers (services, engine and xmeta). Re-triggered the jobs. Once again, all jobs started running fine for next 2 days and then same issue re-surfaced.

What could be the reason of this issue. Please share your valuable inputs on how to resolve it.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

What is your datastage job log retention policy on the new environment?

Do you have any message handlers set up on the projects?
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Log retention policy is set as auto purge : Last 10 runs.

We have two message handlers defined at project level for 2 projects(one for each project). #1 and #2 below. We have two more message handlers defined and used in couple of jobs(#3 and #4 below).

(1)

IIS-DSEE-TFXR-00017 3 Warning converted to information
IIS-DSEE-TFKL-00030 2 3 APT_CombinedOperatorController(5),0: Lookup table is empty, no further warnings will be issued.


(2)

IIS-DSEE-TFXR-00017 3 Warning converted to information
IIS-DSEE-TFKL-00030 2 3 APT_CombinedOperatorController(1),0: Lookup table is empty, no further warnings will be issued.


(3)

IIS-DSEE-TFIG-00121 3 Warning converted to information

(4)

IIS-CONN-ORA-01003 3 End of communication Channel
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

So my thoughts were these:

If your log files get bigger and bigger, it may lead to a slowness when updating them since it's pretty much a full table scan to update those linked list entries.

Your message handlers don't remove the issue, it just removes the logging in the log file of the issue. I have seen delays associated with the base issue that generated the warnings to begin with.

I'm not a big fan of message handlers to begin with and prefer clean jobs rather than sweeping stuff under the carpet. But... they have their uses I guess...


How big are your log files getting? Are you pushing into the GB range?

10 days sounds like a lot of space consumption for that many jobs.

if you were to compile a job, and run it... does it speed up and then slow down after a few days? You could pick one job to test that theory.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

10 runs isn't bad, and should not be pushing the logs to GBs however You may need to check if these auto purge settings are actually executing.

However I would check if the shared memory parameters and ulimit values are at recommended level. IS DS installed on VMs/LPARs or Physical servers? did you check if there is any delay in resource assignment due to VM settings?
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are you seeing a large number of defunct osh or dsapi processes clogging up the process table?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Do you have the default values in the uvconfig file? Could you publish the values that you set for RLTABSZ/GLTABSZ/MAXRLOCK?
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Thanks for your inputs guys.

This is what we have in uvconfig file.

RLTABSZ 300
GLTABSZ 75
MAXRLOCK 299

Yes, we are seeing large number of defunct processes owned by dsadm user, defunct processes noticed at job run time.


The RT LOG/STATUS files' size is normal. Largest file with 0.97 MB so far. We started running the jobs last week only, on newly upgraded version 11.5.0.2, auto purge settings are also working as expected. DataStage is installed on LPARs.

Here are the ulimits settings for DSADM user.

dsadm:

fsize = -1
cpu = -1
data = -1
stack = -1
rss = -1
nofiles = 102400
nofiles_hard = 102400
Arvind
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Looks like the value of GLTABSZ is a bit lower, 300 is like the minimum recommended for 11.x. The minimum combination suggested by the big blue is 300/300/299 and this are minimum values, dependent on your environment you might need increase them

Maybe you would like to publish the rest of the parameters in the uvconfig file, so we could take a look

Hope it helps!! Let us know.how.it.goes
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

IBM customer support suggested below changes in uvconfig.

Change MFILES from default 200 to 500.
Change T30FILES from default 512 to 1000
Change GLTABSZ from default 75 to 300

We made above three changes BUT issue still not resolved. Then we reverted back T30FILES from 1000 to 512 and GLTABSZ from 300 to 75.

Here is the uvconfig values.

/opt/IBM/InfoSphere/InformationServer/Server/DSEngine: bin/smat -t
Current tunable parameter settings:
* MFILES = 500
T30FILE = 512
OPENCHK = 1
WIDE0 = 0x3dc00000
UVSPOOL = /tmp
UVTEMP = /tmp
SCRMIN = 3
SCRMAX = 5
SCRSIZE = 512
QDEPTH = 16
HISTSTK = 99
QSRUNSZ = 2000
QSBRNCH = 4
QSDEPTH = 8
QSMXKEY = 32
TXMODE = 0
LOGBLSZ = 512
LOGBLNUM = 8
LOGSYCNT = 0
LOGSYINT = 0
TXMEM = 32
OPTMEM = 64
SELBUF = 4
ULIMIT = 128000
FSEMNUM = 23
GSEMNUM = 97
PSEMNUM = 64
FLTABSZ = 11
GLTABSZ = 75
RLTABSZ = 300
RLOWNER = 300
PAKTIME = 300
NETTIME = 5
QBREAK = 1
VDIVDEF = 1
UVSYNC = 0
BLKMAX = 8192
PICKNULL = 0
SYNCALOC = 0
MAXRLOCK = 299
ISOMODE = 1
PKRJUST = 0
PROCACMD = 0
PROCRCMD = 0
PROCPRMT = 0

ALLOWNFS = 1
CSHDISPATCH = /bin/csh
SHDISPATCH = /bin/sh
DOSDISPATCH = NOT_SUPPORTED
* NLSMODE = 1
NLSREADELSE = 1
NLSWRITEELSE = 1
NLSDEFSOCKMAP = NONE
NLSDEFFILEMAP = ISO8859-1
NLSDEFDIRMAP = ISO8859-1+MARKS
NLSNEWFILEMAP = NONE
NLSNEWDIRMAP = ISO8859-1
NLSDEFPTRMAP = ISO8859-1
NLSDEFTERMMAP = ISO8859-1
NLSDEFDEVMAP = ISO8859-1
NLSDEFGCIMAP = NONE
NLSDEFSRVMAP = MS1252-CS
NLSDEFSEQMAP = ISO8859-1
NLSOSMAP = ISO8859-1+MARKS
NLSLCMODE = 1
NLSDEFUSERLC = US-ENGLISH
NLSDEFSRVLC = US-ENGLISH
LAYERSEL = 0

OCVDATE = 0
MODFPTRS = 1
THDR512 = 0
UDRMODE = 0
* UDRBLKS = 0
MAXERRLOGENT = 100
JOINBUF = 4095
64BIT_FILES = 0
TSTIMEOUT = 60
PIOPENDEFAULT = 0
MAXKEYSIZE = 768
SMISDATA = 0
EXACTNUMERIC = 15
MALLOCTRACING = 0
CENTURYPIVOT = 1930
SPINTRIES = 0
SPINSLEEP = 0
DISKCACHE = -1
DCBLOCKSIZE = 16
DCMODULUS = 256
DCMAXPCT = 80
DCFLUSHPCT = 80
DCCATALOGPCT = 50

DCWRITEDAEMON = 0
DMEMOFF = 0x90000000
PMEMOFF = 0xa0000000
CMEMOFF = 0xb0000000
NMEMOFF = 0xc0000000
AUTHENTICATION = 0
* IMPERSONATION = 1
* INSTANCETAG = ade
HOSTFILELOCKING= 0
AUTHORIZATION = 0
Arvind
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

arvind_ds wrote:Then we reverted back T30FILES from 1000 to 512 and GLTABSZ from 300 to 75.
Why?
-craig

"You can never have too many knives" -- Logan Nine Fingers
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

All values looks good according to best practices, except the GLTABSZ that is lower than the recommended, and that I would advice pointing UVSPOOL/UVTEMP to other file system rather than /tmp which is normally a share temp file system to all applications running on your server. How big is your /tmp?

In your position I would follow IBM support advice up to the point, that way you will get the next tier of support from IBM ...eventually engineer will get involved if regular support don't get the resolution

Is always 60 jobs the number of concurrent running jobs when the slowness shows up? That's not a lot for the sizing of your environment IMHO, but also depend on the profile of your jobs

Would you give it a try to increase the GLTABSZ value to 300 and get back to us?
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

So I think you need to take a step back and look at the environment and the nature of the jobs. Do a triage of cause and effect for your debugging.

Ensure you have the workload manager turned off on the project in question.
Do all project suffer at the same time or is it just one?
Are you running in a GRID/Cluster or is it a stand along host?
Are your jobs mostly doing a similar activity? Like an ODBC connection?
Does the slowness happen after X amount of days or is it X amount of jobs run in the environment?
Do you see ANY other resources on the host not getting freed up?
- memory, file handles, temp files, IPC sockets, etc...
Do you have multiple user ids that run jobs or just one to run them all?
Do you have a functional environment to compare against?
- if this stuff tanks in prod... does DEV also suffer the same fate if given the same workload over time?
Is anyone else using the host for NON-Datastage activities?
Are there any orphaned PIDS out on your box?
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Since you are on LPARs, please run below commands when you experience slowness (please use -a if -A doesn't work).

Code: Select all

sar -u 5 5
sar -r 5 5
sar -A 1 1 
First 2 commands will give you CPU and memory stats 5 times every 5 seconds. In third commands output I am more interested in number of faults.

I would also recommend to set timeout in administrator to 86400 seconds as you mentioned a lot of defunct processes running unless you have jobs running more than 24 hours.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Ask your AIX admin and the networking guy about how the LPAR internal lan is set up between the LPAR image and the actual physical network card.

You might be on a 10GB network going out of the card, but between your host image and the card sits a 1GB internal I-LAN configuration. This is typically done because of standards. Since the 1 (or two) network cards have to service multiple LPARs, they (system admins) often configure the inter communication to be a 1GB connection.
Post Reply