CPU usage by job processes

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
djwagner
Premium Member
Premium Member
Posts: 17
Joined: Mon Jul 31, 2006 11:37 am

CPU usage by job processes

Post by djwagner »

Long time reader, first time poster... :)


I have inherited a project that uses Datastage 7.5.1.A Server Edition, running on a workstation-class machine with two Intel 3.9Ghz Dual Core Processors, and 3GB of RAM. Windows XP is the OS. (We understand Windows XP is a non-supported OS, and are running on a workstation-class machine/workstation operating system for political reasons around servers, internal to our department, unfortunately.)

We have been running successfully for 2-3 years, but would like to increase performance, as our cycle times are longer than desired (have always been long), but we cannot locate a bottleneck in CPU, memory, or disk performance at this time.

I notice odd CPU behavior on many of our jobs, when looking at either Perfmon or the Performance tab of Task Manager, and am trying to investigate the issue (or reason) using one particular job to test with.

The job that I am testing with reads in approximately 300,000 rows (150 columns) from a sequential file. It then has a transform that performs lookups against 7 different hash files (small number of rows/columns in each hash file), and also calls small routines for simple calculations or mappings for many of the columns. Finally, it writes the results to a sequential file.


When this job is run, CPU utilization is only 25%. If I redesign the job with either the IPC stage or Link Partitioner to create more than one process (the box has two CPUs of dual core equaling 4 logical processors), the sum of the individual processes of the job STILL don't exceed 25% utilization at any given time. (i.e. Process A approaches 25% when Process B approaches 0% and it fluctuates back and forth to any degree, where the sum of the two processes never go above 25%.) In fact, using the IPC or Link Partitioner designs cause the job to run longer!

If I divide my input file into two separate files of 150,000 rows each, and run two copies of this job concurrently, then each jobs' processes will use approximately 25% CPU or less, so that the overall server is taxed approximately 50% or less. In the end, my run time for this job goes from 10 minutes to 8 minutes by having TWO SEPARATE JOBS process concurrently. Thus, I am able to use more processing power with this approach.

Can someone tell me why this is occurring? Why am I unable to get a single job run at more than 25% CPU, even with multiple processes? Is Datastage simply not *asking* for any additional CPU, due to another bottleneck in the system? There is plenty of free physical memory available, and disk performance (disk queue length) looks completely fine during the run of this job. What else could be causing a bottleneck, that running two separate jobs is helping? Are there architectural limitations that either a workstation-class machine or Windows XP is introducing?

Any assistance in helping me understanding what I am observing, (or what I am doing wrong), is greatly appreciated.

Thanks,
David
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

When you run the job enable stage tracing on the Transformer stage, and capture performance statistics. These are logged as a tab-delimited report in a log entry; copy it from there and paste it into Excel. The two time columns (Minimum and Average) are in microseconds. Add a column that is Count x Average to get approximate total time spent in each activity.

I suspect you will see a lot of time spent performing lookups.

Could I suggest using no more than four lookups per Transformer stage (so two Transformer stages in your job), and enable inter-process row buffering for that job?

Use Monitor (in Director) to get CPU per stage process. Right click in the background and choose "Show CPU". This figure is reported as percent of available CPU. For example 34% means that 0.34 CPU seconds were used in one clock second. Beware that these figures are approximate because the clock time is rounded to whole seconds. The "active stage finishing" event in the job log gives CPU and clock time in milliseconds for Transformer stages.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djwagner
Premium Member
Premium Member
Posts: 17
Joined: Mon Jul 31, 2006 11:37 am

Post by djwagner »

Hello,

Thanks for your response!

I ran the performance statistics on the job as-is, and it appears that most of the cpu time is occurring on not the hash lookups, but rather, on the transformer, where many routines, functions, and miscellaneous if-then logic are occurring to transfrom practically every field of every record. They may include mappings of code and description type fields, as well as some financial formulas for numeric fields. There are more field-level formulas in this transformer that I originally thought!

Regardless, I changed the job to have a maximum of four hash lookups per transformer, per your recommendations. However, I added one additional transformer (for a total of 3) to handle as many of the field-level formulas as possible, where the formulas were not reliant on any fields in the hash lookups. I also turned on inter-process row buffering for the job.

While this redesigned job actually took LONGER (30 minutes to complete instead of 10 minutes), the odd CPU behavior that I previously described changed. While the system unfortunately still did not use more than about 25% total CPU, the individual processes did not consume CPU like before, where processes approached zero while other processes approached 25% (i.e. Process A consumes 4% CPU while Process B consumes 21% CPU, and vice versa, flip-flopping back and forth the ENTIRE time). Instead, each of the processes followed approximately the same amount of CPU utilization. (i.e. Process A used 8% CPU, Process B used 9% CPU, and Process C used 8% CPU, and they fluctuated equally at all times).

This CPU behavior looks more normal, compared to other applications, but why is it not consuming more CPU (and therefore running faster)? Any ideas of the bottleneck? There is plenty of CPU, RAM, and Disk available, but it doesn't seem to be using it.


Thanks,
David
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do you have read cache enabled on all the Hashed File stages?

DataStage will only consume the CPU it needs - this will be below 100% if it is waiting on events (such as I/O, waiting for COMMITs, etc.) As a test of this theory change your final stage to a Sequential File stage to see whether that makes any difference. (Hint: use a copy of the job, that you can throw away after you've finished testing.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djwagner
Premium Member
Premium Member
Posts: 17
Joined: Mon Jul 31, 2006 11:37 am

Post by djwagner »

Yes, the read cache is enabled on each of the hashed file stages.

The final target destination is a sequential file local to the datastage box, as to eliminate any network or database activity whatsoever. This is the configuration that I have been testing in.

Do you have any thoughts as to whether the internal architecture (bus, RAM speed, ???) of this workstation-class machine (high-end workstation) is causing a bottleneck? Or perhaps the Windows XP Operating System?

I would like to test performance on a server-class machine running Windows 2000, which is something that we are trying to locate, but will take a while in our organization. :(
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Your original job is working perfectly. You have 4 cpu cores, so a single-threaded job will use 1 cpu at the most. Adding more transformers in a chain will not necessarily improve parallelism.

The splitting of the source data into two sets (what is called partitioning) and running multiple instances (divide and conquer) is what is called "partitioned parallelism" and is the ideal approach for processing the same amount of data in less time.

You have 4 cores and a job that runs at 100% cpu. This is the best case tuning scenario. Use multiple job instances and a partitioning calculation to divide the source data prior to intense transformation. I suggest 5 job instances to leverage 5 instances over 4 cpus to keep your cores working at full speed.

You can see my Performance Analysis whitepaper on my website members forum which clearly articulates your situation and how to use partitioned parallelism techniques to improve overall transformation output.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

By the way, even running more processes to tax your cpus may take more time because prior to this your disk subsystem didn't have to keep up. Only one instance executing didn't sufficiently expose the bottlenecks of the disks. Getting more instances going may cause less cpu utilization as the jobs degrade the longer memory/disk activity comes into play.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hi,

This post reminds me of my inital days ow working with DataStage 7.5.1A (Server edition) on Windows.

The same problme in my case was resolved by following almost ALL of the suggestion highlighted in the post.

Only, thing I would like to add is, to please check the place where the routines are invoked. If a routine is invoked from a Stage Variable and then using the Stage Variable in the Output field derivation , then it is executed for each row being processed, whereas, if the routine is invoked directly in the output derivation of the fields, then it gets executed ONLY for the valid output row.

Gurus, please guide me if I am wrong :)
Share to Learn, and Learn to Share.
Post Reply