Server Jobs hanging while loading Hashed Files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rmcclure
Participant
Posts: 48
Joined: Fri Dec 01, 2006 7:50 am

Server Jobs hanging while loading Hashed Files

Post by rmcclure »

A question for the old-timers:

I am working with DataStage 7.5.2 on a Windows server 2003 machine. We are having an issue with daily server jobs intermittently hanging while loading hashed files.

We don't believe the hanging is on the DB side because. We have many jobs that read from the DB to transform data then reload the DB and they never hang, but various jobs that read from the DB and load hashed files do hang and always at the same point.
I have investigated while the job is hanging and what I have discovered is that at the point of hanging the records have been read from the DB and loaded into the Hashed file, but the modified date time in the windows folder is the previous days date. I have played with settings: Delete File before create vs clear file before writing and they both hang. Currently I am trying without "Allow stage write cache", but I am running out of guesses.

Based on the fact that the file has been loaded with current data but the timestamp not changed I am suspecting that DataStage and the Operating system are not talking to each other.


Notes of interest:
- This didn't just start happening. We can go a couple of weeks with no hanging or it might hang 3 days in a row then go a stretch with no hanging.
- Killing the job, resetting and re-run works fine every time.
- It is also never the same job over and over. It is as if the DataStage goblins are pulling names from a hat.
- The system people have looked at memory and CPU usage and both are well below the maximum.

Has anyone encountered this sort of thing before?
Thanks in advance
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Define "hanging" for us, please. Do you mean the process never finishes and you end up killing the job? Or it just takes much longer than it normally does? I'm going to assume the latter for the moment... well, at least until I post this and re-read your problem statement. Missed the "kill and restart just fine" part for some reason. :? How long have you waited before you've killed the jobs in question?

Never was a big fan of that "allow stage write cache" option, while it can speed up the loading of the file while it's in the shove it in memory phase, it still needs to flush it all to disk before you can use it. So if you have issues on the writing side, they'll still be there. I'm assuming these hashed files are on the large side, yes? Have you looked into pre-creating them with an appropriate minimum modulus so they don't have to constantly fragment the data as it is being written out? That, from what I recall, was the best way to get performance out of hashed files like that. There's also the option of moving from the default of a dynamic Type 19 hashed file to one of the static variants, but let's save that for later.

Let's see what official information we can find on the subject. Here's a couple - first the types explained. Note the warning on static hashed files. I've been on both sides of that, the side where they were a godsend and the side where they made things worse. Let's trying sticking with the default dynamic hashed files for now.

Secondly, how to calculate the minimum modulus. Hope they help.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Now, if these are not large hashed files and it is happening regardless of size then I'm not really sure at the moment what might be up with that. Never had the 'pleasure' of working with DataStage in a Windows environment, so not sure how that might be affecting the behavior.

Another question - when you kill and reset the aborted job, do you find a new entry in the log labelled "From previous run"?
-craig

"You can never have too many knives" -- Logan Nine Fingers
rmcclure
Participant
Posts: 48
Joined: Fri Dec 01, 2006 7:50 am

Post by rmcclure »

-Hanging means it never finishes. Where normal run time is 6 seconds, it has "hung" for as long as 4 hours.
-Minumum modulus is set to 1 (out of the box). We never changed it.
-Size doesn't matter. The current job had 150k records. Strangely we have a job with 38 million records and it never hangs.


From previous run
DataStage Job 5233 Phantom 6064
Job Aborted after Fatal Error logged.
Program "DSD.WriteLog": Line 250, Abort.

Could it be a problem with the log?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I... don't think so. I think that's just the result of you killing the job.

Was hoping Ray would stop by and offer up some words of wisdom. That or Ken Bland but I don't think he visits anymore. Both were Server Warriors and very helpful back when I was working with Server jobs (Version 3.0!) long before Parallel was even a twinkle in daddy's eye. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

After the job aborts, reset it (do not recompile it). Is there any additional information logged "from previous run"? This may signal whether there is a problem with the hashed file, for example.

Consider deleting and re-creating the hashed file. This can be done via Hashed File stage properties.

Consider using the Hashed File Calculator (provided with installation media as an unsupported utility) to guesstimate appropriate sizing parameters for the hashed file, particularly MINIMUM.MODULUS for a dynamic (Type 30) hashed file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rmcclure
Participant
Posts: 48
Joined: Fri Dec 01, 2006 7:50 am

Post by rmcclure »

ray.wurlod wrote:After the job aborts, reset it (do not recompile it). Is there any additional information logged "from previous run"? This may signal whether there is a problem with the hashed file, for example.

...
Thank-you for the response. Unfortunately my previous posting of "from previous run" was after doing a reset.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Consider deleting and re-creating the hashed file. This can be done via Hashed File stage properties.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rmcclure
Participant
Posts: 48
Joined: Fri Dec 01, 2006 7:50 am

Post by rmcclure »

Hi Ray,

Thanks for the suggestion but I have tried that.
Original post: "I have played with settings: Delete File before create vs clear file before writing and they both hang."

There are two things stuck in my mind over this:
1) The fact that the records have been written to the Hashed file but the in the windows folder the timestamp on the file is from the previous day.
2) The SQL select process is still "active" on the source DB and if I kill that process it aborts the DS job. But if I kill the DS job it does not abort the DB process.

I might be way off but in 1) it seems that DS is not telling the OS it is done loading the file and in 2) it seems DS is not communicating with the DB. In both cases it points to a communication issue.
Post Reply