DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
rmcclure
Participant



Joined: 01 Dec 2006
Posts: 48

Points: 586

Post Posted: Fri Feb 23, 2018 8:12 am Reply with quote    Back to top    

DataStage® Release: 7x
Job Type: Server
OS: Windows
A question for the old-timers:

I am working with DataStage 7.5.2 on a Windows server 2003 machine. We are having an issue with daily server jobs intermittently hanging while loading hashed files.

We don't believe the hanging is on the DB side because. We have many jobs that read from the DB to transform data then reload the DB and they never hang, but various jobs that read from the DB and load hashed files do hang and always at the same point.
I have investigated while the job is hanging and what I have discovered is that at the point of hanging the records have been read from the DB and loaded into the Hashed file, but the modified date time in the windows folder is the previous days date. I have played with settings: Delete File before create vs clear file before writing and they both hang. Currently I am trying without "Allow stage write cache", but I am running out of guesses.

Based on the fact that the file has been loaded with current data but the timestamp not changed I am suspecting that DataStage and the Operating system are not talking to each other.


Notes of interest:
- This didn't just start happening. We can go a couple of weeks with no hanging or it might hang 3 days in a row then go a stretch with no hanging.
- Killing the job, resetting and re-run works fine every time.
- It is also never the same job over and over. It is as if the DataStage goblins are pulling names from a hat.
- The system people have looked at memory and CPU usage and both are well below the maximum.

Has anyone encountered this sort of thing before?
Thanks in advance
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42622
Location: Denver, CO
Points: 219444

Post Posted: Fri Feb 23, 2018 8:38 am Reply with quote    Back to top    

Define "hanging" for us, please. Do you mean the process never finishes and you end up killing the job? Or it just takes much longer than it normally does? I'm going to assume the latter for the moment... well, at least until I post this and re-read your problem statement. Missed the "kill and restart just fine" part for some reason. Confused How long have you waited before you've killed the jobs in question?

Never was a big fan of that "allow stage write cache" option, while it can speed up the loading of the file while it's in the shove it in memory phase, it still needs to flush it all to disk before you can use it. So if you have issues on the writing side, they'll still be there. I'm assuming these hashed files are on the large side, yes? Have you looked into pre-creating them with an appropriate minimum modulus so they don't have to constantly fragment the data as it is being written out? That, from what I recall, was the best way to get performance out of hashed files like that. There's also the option of moving from the default of a dynamic Type 19 hashed file to one of the static variants, but let's save that for later.

Let's see what official information we can find on the subject. Here's a couple - first the types explained. Note the warning on static hashed files. I've been on both sides of that, the side where they were a godsend and the side where they made things worse. Let's trying sticking with the default dynamic hashed files for now.

Secondly, how to calculate the minimum modulus. Hope they help.

_________________
-craig

And I'm hovering like a fly, waiting for the windshield on the freeway...
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42622
Location: Denver, CO
Points: 219444

Post Posted: Fri Feb 23, 2018 8:44 am Reply with quote    Back to top    

Now, if these are not large hashed files and it is happening regardless of size then I'm not really sure at the moment what might be up with that. Never had the 'pleasure' of working with DataStage in a Windows environment, so not sure how that might be affecting the behavior.

Another question - when you kill and reset the aborted job, do you find a new entry in the log labelled "From previous run"?

_________________
-craig

And I'm hovering like a fly, waiting for the windshield on the freeway...
Rate this response:  
Not yet rated
rmcclure
Participant



Joined: 01 Dec 2006
Posts: 48

Points: 586

Post Posted: Fri Feb 23, 2018 12:17 pm Reply with quote    Back to top    

-Hanging means it never finishes. Where normal run time is 6 seconds, it has "hung" for as long as 4 hours.
-Minumum modulus is set to 1 (out of the box). We never changed it.
-Size doesn't matter. The current job had 150k records. Strangely we have a job with 38 million records and it never hangs.


From previous run
DataStage Job 5233 Phantom 6064
Job Aborted after Fatal Error logged.
Program "DSD.WriteLog": Line 250, Abort.

Could it be a problem with the log?
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42622
Location: Denver, CO
Points: 219444

Post Posted: Sat Feb 24, 2018 7:58 pm Reply with quote    Back to top    

I... don't think so. I think that's just the result of you killing the job.

Was hoping Ray would stop by and offer up some words of wisdom. That or Ken Bland but I don't think he visits anymore. Both were Server Warriors and very helpful back when I was working with Server jobs (Version 3.0!) long before Parallel was even a twinkle in daddy's eye. Wink

_________________
-craig

And I'm hovering like a fly, waiting for the windshield on the freeway...
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54254
Location: Sydney, Australia
Points: 294257

Post Posted: Tue Feb 27, 2018 4:46 pm Reply with quote    Back to top    

After the job aborts, reset it (do not recompile it). Is there any additional information logged "from previous run"? This may signal whether there is a problem with the hashed file, for example. ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
rmcclure
Participant



Joined: 01 Dec 2006
Posts: 48

Points: 586

Post Posted: Wed Feb 28, 2018 11:14 am Reply with quote    Back to top    

ray.wurlod wrote:
After the job aborts, reset it (do not recompile it). Is there any additional information logged "from previous run"? This may signal whether there is a problem with the hashed file, for example.

...


Thank-you for the response. Unfortunately my previous posting of "from previous run" was after doing a reset.
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54254
Location: Sydney, Australia
Points: 294257

Post Posted: Thu Mar 01, 2018 8:54 pm Reply with quote    Back to top    

Consider deleting and re-creating the hashed file. This can be done via Hashed File stage properties.

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
rmcclure
Participant



Joined: 01 Dec 2006
Posts: 48

Points: 586

Post Posted: Fri Mar 02, 2018 7:55 am Reply with quote    Back to top    

Hi Ray,

Thanks for the suggestion but I have tried that.
Original post: "I have played with settings: Delete File before create vs clear file before writing and they both hang."

There are two things stuck in my mind over this:
1) The fact that the records have been written to the Hashed file but the in the windows folder the timestamp on the file is from the previous day.
2) The SQL select process is still "active" on the source DB and if I kill that process it aborts the DS job. But if I kill the DS job it does not abort the DB process.

I might be way off but in 1) it seems that DS is not telling the OS it is done loading the file and in 2) it seems DS is not communicating with the DB. In both cases it points to a communication issue.
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours