How to setup DataStage for MPP system

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

How to setup DataStage for MPP system

Post by dsusr »

Hi All,

We are having 3 different Linux servers and each server is having 4 CPUs. All these servers have DataStage PX installed on them. I want to connect all these servers and need to use them for Parallel processing.

From the manager guide i got that to prepare DataStage to work on MPP system we just need to change the configuration file, i just want to know what type of connectivity needs to be open between these since as of now only ssh is enabled.

Do we need to have telnet enabled between these systems for connectivity?

Thanks
dsusr
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You don't need telnet, but you do need TCP/IP, and the ability for processes to communicate via sockets. You should use as fast a network connectivity as you can get - not less than 1 Gbit imho.

You didn't actually need the full DataStage PX installed on each, unless you want to run separate server jobs on each or use BASIC Transformer stage or server Shared Containers on all nodes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

Hi,

I have modified the configuration file and have added one node that signifies the second server in that configuration file. But when I am running my job i am getting the following error:-

main_program: The section leader on <server2> died

main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.


I have even checked for the rsh and is properly installed on the system.

Thanks,
dsusr
Eric
Participant
Posts: 254
Joined: Mon Sep 29, 2003 4:35 am

Post by Eric »

1) have you installed DataStage PX into the same path on all machines?
2) have you tested rsh using the machine names in your APT_CONFIG file?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Read the error message carefully, then read the Install and Upgrade Guide and complete any steps you missed (such as permitting processes on one machine to execute on another as "trusted").

You really should test configuration files before attempting to use them. There is a test utility on the Configuration File editor in Manager.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

Hi,

Yes I have installed datastage on both the servers at same location. Also I have tested rsh on both the nodes by giving the command
rsh server1name uptime & rsh server2name uptime and both are giving correct result.

My configuration file is as follows:-

{
node "node1"
{
fastname "PUN020"
pools ""
resource disk "/opt/datastage/Ascential/DataStage/Datasets" {pools ""}
resource scratchdisk "/opt/datastage/Ascential/DataStage/Scratch" {pools ""}
}
node "node2"
{
fastname "PUN040"
pools ""
resource disk "/home/dsadm" {pools ""}
resource scratchdisk "/Scratch" {pools ""}
}
}

When I am testing this configuration file using the check utility from manager it is giving the following error:-

##E TFIO 000211 14:49:16(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> APT_Communicator::connectTo: connect() failed due to Unix error = 111 (Connection refused) on node PUN020 using ConnectionInfo object 'TCP, connection Host: PUN020 (127.0.0.1), TCP port number: 11001', RETRYING connect()

##E TFIO 000211 14:49:16(001) <APT_RealFileExportOperator in APT_FileExportOperator,0> APT_Communicator::connectTo: connect() failed due to Unix error = 111 (Connection refused) on node PUN020 using ConnectionInfo object 'TCP, connection Host: PUN020 (127.0.0.1), TCP port number: 11001', RETRYING connect()

##F TFIO 000112 14:49:16(002) <APT_RealFileExportOperator in APT_FileExportOperator,0> Fatal Error: APT_Communicator::pmSendPartitionInfo() failed on node PUN020 for partition 0 of dataset 0 (write failed to handle 14) Bad file descriptor

##E TFPM 000192 14:49:16(000) <node_node1> Player 2 terminated unexpectedly.

##E TFPM 000338 14:49:16(004) <main_program> Unexpected exit status 1

##E TFSR 000011 14:49:21(000) <main_program> Step execution finished with status = FAILED.

##E TOCK 000000 14:49:21(001) <main_program> ERROR: check configuration file failed.



One important point to note here is that this configuration file is on PUN020 server and this check is giving an error while trying to use its own node.

Please let me know if i need to do any other change.

Thanks
dsusr
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

Is anyone having any idea what this APT_Communicator mean. Also is there any documentation for setting up an MPP system.

Thanks
dsusr
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

"Connection refused" usually indicates that the two machines are not in a trusted relationship. Have you made entries in the appropriate files, such as lmhosts, to enable this?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
daniel0623
Charter Member
Charter Member
Posts: 34
Joined: Tue May 31, 2005 8:17 pm
Location: ShangHai,China

Post by daniel0623 »

firstly,make sure rsh service has been started.
secondly,add .rhosts file into dsadm home directory on each machine, and add user dsadm in .rhosts file.
thirdly, test rsh whether connected or not.
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

daniel0623 wrote:firstly,make sure rsh service has been started.
secondly,add .rhosts file into dsadm home directory on each machine, and add user dsadm in .rhosts file.
thirdly, test rsh whether connected or not.
Hi daniel/Ray,

Pardon me for replying late. Yes I have tested for the rsh from both the servers and even i am able to login on any of the server using rsh. This I have tried using dsadmn userid only.

The issue is that the server from which i am running the job is not identifying the node name of it's own node. If i try to test the configuration file for that single node only then it is not giving any error.

Thanks & Regards
dsusr
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Have you checked the hosts files on all systems? How exactly is name to IP address resolution performed?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

ray.wurlod wrote:Have you checked the hosts files on all systems? How exactly is name to IP address resolution performed?

Hi Ray,

Yes in the hosts file the host name is mapped to correct ip address.

If i try to login on the server using rsh and hostname it is able to login on the server. Also the error is coming for its own node, the node which is working fine for other config file.


Thanks & Regards
dsusr
chenxs
Participant
Posts: 30
Joined: Mon Dec 27, 2004 3:11 am

hi

Post by chenxs »

hi, have you solved this problem?

we meet this issue also, please tell me how to solve

thanks a log~
weela_lee
Participant
Posts: 2
Joined: Tue Dec 28, 2004 8:00 pm

Post by weela_lee »

I met the same problem today. After I did the following setting, it works!
1. set the user dsadm and it's group's id all the same on all cluster;
2. not only set home directory .rhosts but also set /etc/hosts with all cluster info;

Wish it help :D
ib_icf
Participant
Posts: 8
Joined: Thu Feb 09, 2006 8:16 am
Location: hd

Post by ib_icf »

I met the same problem, resolved it now:
##E TFIO 000211 14:49:16(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> APT_Communicator::connectTo: connect() failed due to Unix error = 111 (Connection refused) on node PUN020 using ConnectionInfo object 'TCP, connection Host: PUN020 (127.0.0.1), TCP port number: 11001', RETRYING connect()
This problem is caused bec of the wrong ip address configuration, according to the error message above, there must be one line in /etc/hosts like this:

Code: Select all

127.0.0.1 PUN020 localhost 
change it to

Code: Select all

127.0.0.1 localhost 
realip PUN020
hope it's helpful...
Post Reply