Page 1 of 2

Slow DataStage performance

Posted: Wed Nov 29, 2017 10:11 am
by jackson.eyton
Hi everyone,
So I have been wondering how to present this question to IBM and figured I would ask here first. A quick summary is that DataStage Designer and all other applications that relate to DataStage (administrator, director, etc) will periodically tweak out and become incredibly slow. There are two scenarios in which this happens.

1. If my fellow ETL developer and I are both in DataStage Designer, working our separate jobs, and one of us decides to run a job, DataStage will then crunch for both of us until that job has completed. Meaning if he runs his job, I then cannot open any stages in my job without DataStage going "Not Responding" until his job has completed. This seems very odd to me.

2. Various scenarios can occur where DataStage Designer simply starts hanging up for one of us for no obvious reason. Notably yesterday I highlighted about 8 jobs that were backup copies and attempted to delete them. My DataStage stopped responding and as soon as that happened my coworker's DataStage started working incredibly slow. I waited an hour and ended up killing the process and disconnecting all of my sessions via the admin console. The slowness issue however did not resolve and persists even today. We have had this happen before with no obvious rhyme or reason and we've had to reboot the server to correct this.

We have a development environment on one server, we'll call it Dev1 and our web services for infosphere on another server, we'll call Svc1. These servers both run fully patched windows sever 2012. We have monitored the system performance and resources of both servers during both the above scenarios and there are no persistent spikes of any resources (CPU, Memory, Disk, or even Network). We are using Fat clients, not terminal services so the bog down of DataStage Designer when another individual is even running a job confuses me.

Does anyone have any suggestions on where to start with this?

*EDIT*
We have multiple projects on this development server and I have confirmed that the performance issue is persistent among the projects, pointing to the server as the common denominator.

*UPDATE*
We were able to get some improvement by closing the log pane in Designer. Thus we cleared some of the logs, which retention is set to the last 3 runs of a job, and this also helped some. Opening a stage in a job and looking at its properties for example now seems normal. However, saving a new job and compiling still take far longer than a freshly booted server. A job as simple as SRC---TRXFM---DEST would normally take 30 seconds or so to compile, now takes nearly 10 minutes (6 minutes as of last clocking).

Posted: Wed Nov 29, 2017 10:38 am
by boxtoby
Hi
I have had scenario 1 when working from home and the connection not as good as in the office. I found that turning off "show performance statistics" in designer helped a lot.

Hope that helps,
Bob

Posted: Wed Nov 29, 2017 10:44 am
by jackson.eyton
I may give that a shot and see if that makes any improvement, obviously we wont want to leave it off permanently. What strikes me as really odd is that it happens at all. I can understand MY Designer going slow when I run a job MYSELF. However, when my coworker runs a job, that should not slow down my Designer for things as simple as opening a transform stage and getting to the stage properties. Its almost as if Designer runs from as an instance from the server itself and isn't REALLY a fat client.

Re: Slow DataStage performance

Posted: Wed Nov 29, 2017 5:02 pm
by chulett
jackson.eyton wrote:A job as simple as SRC---TRXFM---DEST would normally take 30 seconds or so to compile, now takes nearly 10 minutes (6 minutes as of last clocking).
Is the same thing true if a transformer is not involved? Don't forget they compile down to C++ code and we've seen sites with a single concurrent user compiler license...

Posted: Wed Nov 29, 2017 5:31 pm
by jackson.eyton
THAT is an interesting question.... At this stage of our warehouse development, most of our jobs contain at least one transform stage. Could you point me in the right direction where to check our compiler license status?

Posted: Thu Nov 30, 2017 8:24 am
by qt_ky
We don't generally run into that problem but we don't run our servers on Windows either.

I would be mostly suspicious of any kind of security software, especially anti-virus software running scans on your clients, or even worse, on your server.

My next suspicion would be gremlins on your network. Years ago we did have some extreme network problems that would crop up and bring everything to a grinding halt.

Posted: Thu Nov 30, 2017 9:56 am
by jackson.eyton
Yea, we have done the standard AV disabling, as well as network sniffing and monitoring and so far that all looks fine. I was able to verify the CPU is pegged when I run a transfornation job.

This does not occur when compiling a job. I did a little digging into the compiler license but I am unsure exactly how to confirm how many concurrent compiler licenses we have available. Additionally there are only two of us so we rarely compile anything at the same time. Which from the following quote is the real benefit of being multi-licensed?
"For some compilers, each developer must have a license at the time that the developer compiles the job with the Designer client. The maximum number of simultaneous processes that compile jobs determines the number of licenses."
-SRC https://www.ibm.com/support/knowledgece ... s_cpp.html

As of now the server is still acting up as we are refusing to reboot it just to temporarily resolve the pain. As such, things like just running a job take 2+ minutes between the job log says starting job, and data actually starts processing. :-(

Posted: Thu Nov 30, 2017 11:08 am
by chulett
Only two of you? Not so much an issue then. :wink:

Posted: Thu Nov 30, 2017 11:38 am
by jackson.eyton
Correct, as far as the compiler licensing goes. However the issues we see when we run jobs in addition to the fact that there are only two of us, really raises a red flag for me. I imagine there are companies with dozens of developers all building and testing jobs. Its rather odd that me running a job causes my partners Designer to slow to a crawl and vice versa...

Does anyone have the actual InfoEng server recommended specifications? What I am finding from the IBM knowledge center doesn't make a whole lot of sense.

Posted: Thu Nov 30, 2017 1:05 pm
by PaulVL
it's not odd if someone went and changed your apt file and went nuts with the quantity of nodes.

Posted: Thu Nov 30, 2017 1:25 pm
by jackson.eyton
I have done some playing around with those myself in the past, however our APT settings are configured as a two node system. Our Dev server does only have two cores however. I am of the mind that the server cores should be the result of the following equation: (N * E)+1
Where N is the number of Nodes in APT and E is the number of employees who could be running jobs simultaneously.

Additionally this is not the whole of our issue as the performance hit can occur and maintain itself even when jobs are not running. Such that simply opening jobs, or stages in a job, opening wizards, and even creating build packages are all extremely sluggish.

Posted: Thu Nov 30, 2017 3:45 pm
by qt_ky
Perhaps you are on a virtual server and other virtual machines on the same hardware are hammering away. It could be thin-provisioned and overloaded.

Posted: Thu Nov 30, 2017 6:26 pm
by chulett
On the subject of cores versus nodes, don't forget that "cores" are a physical concept while "nodes" are a logical concept... basically worker threads in the O/S where you typically have no control over which ones run where. There's WAY more to it than that, of course and I'm sure Ray has given a masters class or twelve on the subject but wanted to put out there that there is no simple equation for the number of nodes any given system could support.

<tip-toes silently away from the keyboard>

Posted: Fri Dec 01, 2017 8:44 am
by jackson.eyton
qt_ky,
Yes our server is a vm on a cluster, and our IT department is resistant to adding more cores to our vm as that could potentially decrease performance due to the server having to potentially wait for a higher number of cores to become available.

chulett,
..... O_o ....well hmm....

Posted: Fri Dec 01, 2017 9:36 am
by PaulVL
Here's my two cents:

1) I dislike VMs for any ETL work. To much politics with the back room server folks. You've paid a lot of money to license this software, it's used to push the data that your company feeds off of... get some hardware under it, not a virtual layer.

2) Ditch Windows and go to Unix for this ETL tool.

3) My phone has more cores than your company ETL tool host.

4) Do you have your xmeta running on the same host as your engine? You could farm that off onto a separate host. Thus freeing up CPU/Mem. I would keep the domain tier (WAS) with the engine.


Once you said you had two cores running on windows... there is little surprise that you are getting laggy performance once someone actually runs a job and person number 2 tries to do anything else.


Given that you have 2 cores... I am guessing that money is an issue for your dept. I would recommend getting a second host (since hosts are cheaper than software licensing for DataStage) and farm off all non essential "Data Manipulation" activities to that. Like a zip, an SFTP, etc... have the data on a shared mount. But that zip and ftp, placed upon a different host, will free up CPU and Memory on your primary ETL host. Thus you can run more concurrent jobs. You should write a standard set of scripts to consistently farm off that work to the other box.