Dataset corruption, SIGSEGV while reading

niremy · Post by **niremy** » Wed Feb 16, 2011 11:39 am

Hello,

I'm facing an odd problem and I need your enlightment:
I've a job that fails to read a dataset with the following error:

Code: Select all

Event Id: 5834
Time    : Wed Feb 16 18:03:24 2011
Type    : FATAL
User    : ...
Message :
        DS_001,0: Unable to map file /.../dataset/node1/DS_001.ds...0000.0000.0000.7080.cf254e0d.0005.ce3515ca: Invalid argument
        The error occurred on Orchestrate node node1 (hostname ...)
Event Id: 5835
Time    : Wed Feb 16 18:03:24 2011
Type    : FATAL
User    : ...
Message :
        DS_001,1: Unable to map file /.../dataset/node2/DS_001.ds...0000.0001.0000.7080.cf254e0d.0006.9934fd0a: Invalid argument
        The error occurred on Orchestrate node node2 (hostname ...)
Event Id: 5836
Time    : Wed Feb 16 18:03:25 2011
Type    : WARNING
User    : ...
Message :
        DS_001,0: /bin/echo: write error: Broken pipe
Event Id: 5837
Time    : Wed Feb 16 18:03:25 2011
Type    : FATAL
User    : ...
Message :
        DS_001,1: Operator terminated abnormally: received signal SIGSEGV
Event Id: 5838
Time    : Wed Feb 16 18:03:30 2011
Type    : FATAL
User    : ...
Message :
        DS_001,0: Operator terminated abnormally: received signal SIGSEGV

I've checked disk space during execution and nothing seems to consume much space on disks.

The source file is 84 lines long and weight 20K.

I tried to run the same job with the same file on my test server and everything runs smoothly.

I tried to rerun several times the job with the same file but each time it fails with the very same error.

I also searched this forum and couldn't find any clue to the source of my current problem.

I thank you in advance for all your remarks that could lead me to the resolution of this issue

Sreenivasulu · Post by **Sreenivasulu** » Thu Feb 17, 2011 1:42 am

What is the meaning of 'weight 20K'

gssr · Post by **gssr** » Thu Feb 17, 2011 1:50 am

The dataset was not properly loaded. Check the job that creates the Dataset

niremy · Post by **niremy** » Thu Feb 17, 2011 2:52 am

Sreenivasulu wrote:What is the meaning of 'weight 20K'

20 KBytes
It was to prevent the response "The file is too big"

niremy · Post by **niremy** » Thu Feb 17, 2011 2:54 am

gssr wrote:The dataset was not properly loaded. Check the job that creates the Dataset

How come that the job works perfectly on another server ?

I already checked it multiple times and it doesn't differ from my other dataset creation

Vidyut · Post by **Vidyut** » Thu Feb 17, 2011 2:57 am

R using the same dataset created in your Test Environment??

niremy · Post by **niremy** » Thu Feb 17, 2011 3:18 am

Vidyut wrote:R using the same dataset created in your Test Environment??

In fact I have a job sequence that runs the first job that creates the dataset using information from the flat file and then the second job that reads the dataset.

The dataset is clearly corrupted on one of the server as even the orchadmin dump command fails to read the dataset properly.

I'm puzzled because I don't have any warnings with the creation of the dataset

devesh_ssingh · Post by **devesh_ssingh** » Thu Feb 17, 2011 3:19 am

check the enviorment in which you are reading it...
since it partition it wont work on two different enviorment unless config file is same for both...

i mean if you are reading dataset which created on 8-node config server to 4-node it won't work ....

for that you should create new dataset on 4-node server....

niremy · Post by **niremy** » Thu Feb 17, 2011 3:36 am

devesh_ssingh wrote:check the enviorment in which you are reading it...
since it partition it wont work on two different enviorment unless config file is same for both...

i mean if you are reading dataset which created on 8-node config server to 4-node it won't work ....

for that you should create new dataset on 4-node server....

Thanks for the hint ...

For the job creating the dataset :

Code: Select all

Environment variable settings: 
APT_CONFIG_FILE=/app/EQOPIGL/ISF/Projects/EQOPIGL1/EQOPIGL1_INIT/apt_config_file_2_nodes.apt

For the job reading the dataset

Code: Select all

Environment variable settings: 
APT_CONFIG_FILE=/app/EQOPIGL/ISF/Projects/EQOPIGL1/EQOPIGL1_INIT/apt_config_file_2_nodes.apt

So no luck ...
I forgot to mention that the very same jobs work fine with tiny file with 2 or 3 lines

PaulVL · Post by **PaulVL** » Fri Feb 18, 2011 10:28 am

Show us the content of your APT file.

I'd be interested to see if your datasegments path is a valid path on the server you are executing on.

Also, do you have proper read/write authority to that path.

niremy · Post by **niremy** » Fri Feb 18, 2011 12:05 pm

PaulVL wrote:Show us the content of your APT file.

I'd be interested to see if your datasegments path is a valid path on the server you are executing on.

Also, do you have proper read/write authority to that path.

Here is the content of the APT_CONFIG_FILE:

Code: Select all

 cat /app/EQOPIGL/ISF/Projects/EQOPIGL1/EQOPIGL1_INIT/apt_config_file_2_nodes.apt
{
        node "node1"
        {
                fastname "slxd2003.app.eiffage.loc"
                pools ""
                resource disk "/app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node1" {pools ""}
                resource scratchdisk "/app/EQOPIGL/ISF/Files/EQOPIGL1/scratch/node1" {pools ""}
        }
        node "node2"
        {
                fastname "slxd2003.app.eiffage.loc"
                pools ""
                resource disk "/app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node2" {pools ""}
                resource scratchdisk "/app/EQOPIGL/ISF/Files/EQOPIGL1/scratch/node2" {pools ""}
        }
}

Code: Select all

tree -dpugfDi /app/EQOPIGL/ISF/Files/EQOPIGL1/dataset
/app/EQOPIGL/ISF/Files/EQOPIGL1/dataset
[drwxrwxr-x eqopigl1 eqopigl1 Feb 18 15:05]  /app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node1
[drwxrwxr-x eqopigl1 eqopigl1 Feb 18 15:05]  /app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node2

And my user is eqopigl1 of course.

As a reminder, using the same APT_CONFIG_FILE and the same job with very small file in input, the job works flawlessly.
I'm thinking more of a server mis-configuration whereas you seem to think about bad job design

Nevertheless I appreciate your efforts with helping me finding the problem

niremy · Post by **niremy** » Mon Feb 21, 2011 11:34 am

May I ask for some more comments ?
I'm stuck with this problem and can't see any solutions ...

kshah9 · Post by **kshah9** » Tue Feb 22, 2011 1:19 am

Hey buddy,

Just once contact to your ADMIN team, as I can see the error as "DS_001,0: /bin/echo: write error: Broken pipe", I have faced same issue, and on contacting the DS-ADMIN (Server Team) our issue used to get resolved. So juat an suggestion, to contact the ADMIN team, mentioning the error message.

Not sure, will it resolve the problem, but you can try once.

Regards,
Kunal shah

niremy · Post by **niremy** » Tue Feb 22, 2011 5:43 am

kshah9 wrote: Just once contact to your ADMIN team, as I can see the error as "DS_001,0: /bin/echo: write error: Broken pipe", I have faced same issue, and on contacting the DS-ADMIN (Server Team) our issue used to get resolved. So juat an suggestion, to contact the ADMIN team, mentioning the error message.

Thanks, but I post here on behalf of my admin team, we have the same level of knowledge on this issue

So again any tips will help

chulett · Post by **chulett** » Tue Feb 22, 2011 7:56 am

Have you involved your official support provider yet?