Configuration file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

Configuration file

Post by sachin1 »

Hi Team,

I have 8 CPU on SMP, I am trying to understand if I configure 2 logical processing nodes with 4 CPU's each would be best or should I configure 4 logical processing node with each having 2 CPU's.

Wanted to check on understanding that if number of processing node is increased the processes will also increase and would it degrade performance.

your assistance would help gain knowledge.

regards,
Sachin.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You don't have any control over the physical CPUs on a system via the configuration file. Since you have 8 CPUs, everything you do will have access to all eight, with the operating system controlling what runs where. Your control in the configuration file is the number of logical 'nodes' - how many processes to spawn - dedicated to running the job with (as noted) the O/S deciding which run where. Unless you are talking about building logical virtual partitions on the box?
-craig

"You can never have too many knives" -- Logan Nine Fingers
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

Post by sachin1 »

Hi,

so how are we identifying Conductor and Section leaders and players created.

question is if number of processes created by OS should be less or more for performance tuning.

thanks,
Sachin.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's a pretty open question without a single answer... except for "depends". I'm sure others will come along and help as I've been away from the tool for quite some time but generally speaking more is better - within limits - as there are far too many variables to hand out a single, simple answer. Your options on a single server are more limited so that does ease the burden... somewhat.

As one example, what else will be running at the same time as this job? And I don't mean just how many other DataStage jobs although that will be your easiest question to answer, I would imagine. Unfortunately, performance tuning can be a bit of a dark art and (IMHO) you'd have to start with a well tuned job before you started looking at what the differences in the config file would do. It can be a painful, iterative process - increase resources, measure, increase, measure until performance 'peaks' and then starts to be adversely impacted, then back it back down to the sweet spot. In my experience, most of us don't take it that far and get some version of 'good enough' going without worrying about squeezing that last ounce of performance from something. Well, unless you need to, of course. Back in the day, we were running micro-batches for a specific process with a very specific run-time limit, as in it had to complete in less than X minutes day over day. That was fun. Nailed it, by the way. :wink:

The specifics on "how are we identifying Conductor and Section leaders" I'll have to leave to others. One thing, though, is if you are not aware there are a number of free IBM Redbooks available that might prove illuminating. Some I am aware of:

InfoSphere DataStage Parallel Framework Standard Practices
IBM InfoSphere DataStage Data Flow and Job Design
Deploying a Grid Solution with the IBM InfoSphere Information Server


I'm sure there are others. A bit old and I seem to recall one was not liked much by Ray but they are free and (generally) a good resource.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

Post by sachin1 »

Thanks Craig,

I will go through these document and check out configuration for my specific requirement.

regards,
Sachin.
cdp
Premium Member
Premium Member
Posts: 113
Joined: Tue Dec 15, 2009 9:28 pm
Location: New Zealand

Post by cdp »

Personally, I was able to have an understanding of Conductor/Section leaders/Players, only after reading the last chapter of Ray Wurlod's white paper:

http://dsxchange.net/uploads/White_Pape ... aStage.pdf
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What a delightfully succinct explanation! :wink:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

Post by sachin1 »

Reading through one of the document I found that.

for any jobs that is executed with all parallel executing stages, their would be list of process.

(1 Conductor process) +
(number of processing nodes= Section leaders) +
(number of parallel running stages * section leaders).

so for example for 3 stage job executing on 2 processing nodes(4 cpu's each) we would have.

1 + 2+(3*2)= 9 process created.

now back to original question with 8 CPU system with 4 processing nodes, with 3 parallel executing stages, with above formula I will have 1 + 4+(3*4)=17 process.

so the question is should we have 9 process or 17 process, which is good one.

regards,
Sachin.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Depends. Or again... more.

Look, no one can tell you unequivocally what would be better for your job running on your hardware with your data with everything else your server has running except... well, you.

Try both. More than once. Compare and contrast, then decide. Let us know. Or just go with the 'more is better' approach.
-craig

"You can never have too many knives" -- Logan Nine Fingers
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Sachin, I think you are over analyzing the situation.

You should care about process count if you suspect you are approaching the limit imposed by your NPROCS setting or ULIMITs for that particular user id.

You need to factor in the other concurrent jobs running with that user id.

Job design will affect job performance more then pid count affects job performance.

MORE is not always better.

LESS is not always better.


Performance tuning when done in a vacuum will give you best results when that is the only thing running on the box. But that doesn't happen in real life, so you need to understand what else is running on the box. Which ties back into the pid count and the statement of "Does it really matter? What am I measuring and why?"
Post Reply