Page 1 of 1

DSGetStageInfo giving inaccurate results

Posted: Thu Aug 16, 2018 8:52 am
by thompsonp
Jobs that work as expected on v8.1 have been migrated to v11.5
As part of an after job subroutine the link counts are examined using

Code: Select all

DSGetLinkInfo(JobHandle,ThisStage,ThisLink,DSJ.LINKROWCOUNT)
Based on the naming convention of the links a simple check is performed which is essentially sum of inputs = sum of outputs

In v8.1 this is working, but in v11.5 some jobs report incorrect numbers on links.
As an example a link writing to a dataset might report 5000 records in the after job subroutine, but if I open the dataset in DataSet Management the number of records will be as expected, say 9500.
As a consequence the reconciliation fails and the subroutine aborts the job.
If I manually run the same code in the subroutine after the job has completed it is also giving the wrong link counts.

I have raised a case with IBM and their feedback so far is that JobMonApp can suffer a lag on heavily loaded systems. This system however is new and only being used by me (for now) and is not at all heavily utilised when these failures occur.

I'd appreciate any suggestions about what could be going wrong whilst I await further feedback and suggestions from IBM.

Also if anyone knows the mechanism by which the link counts are captured and stored please can you explain the process; that might point to somewhere else I can look to try and identify the underlying issue.

All I have been able to see is that JobMonApp.log is being written to with details of the counts very frequently. Does another process examine this file and store the results elsewhere when the job completes?

In an effort to workaround the problem during initial testing on v11.5 I added a sleep 10 to the subroutine and it hasn't happened since, but that's not a long term solution.

Re: DSGetStageInfo giving inaccurate results

Posted: Thu Aug 16, 2018 9:35 am
by chulett
thompsonp wrote:In an effort to workaround the problem during initial testing on v11.5 I added a sleep 10 to the subroutine and it hasn't happened since, but that's not a long term solution.
I know, right? The long term solution might be a sleep 30. :wink:

Sorry, wish I had something more useful to add. Others may know the gory details of how it all works under the covers, all I seem to recall is it getting them from one of the job related hashed files in the project (or maybe XMETA now) but don't quote me on that.

Posted: Thu Aug 16, 2018 9:43 pm
by ray.wurlod
Check that the 11.5 version is not returning a dynamic array of row counts, with one element per node.

If it is, you could apply a SUM() function to the dynamic array.

Posted: Fri Aug 17, 2018 7:51 am
by thompsonp
With option DSJ.LINKROWCOUNT the count is returned as a single value.
If I change that to DSJ.INSTROWCOUNT I get a comma separated list of counts for each partition.

Does anyone know what these DataStage Basic functions are examining to get the results? Is it something I can check using some other mechanism?

I could see counts being sent to JobMonApp.log, but are the counts from there written elsewhere before the DS basic functions are able to retrieve them?

Thanks

Posted: Sun Aug 19, 2018 6:50 pm
by ray.wurlod
Historically the row counts were stored in the RT_STATUSnnn table for the job. There are separate records for each stage and link in that table, and the structure of the records has always been undocumented (and are different for job records, stage records and link records).

I don't know whether this storage mechanism is still the case in version 11.x.

Posted: Mon Sep 24, 2018 7:08 am
by thompsonp
The issue went away for a while but has now resurfaced in a test environment.
It's back with IBM support with all kinds of tracing and debugging enabled.