Configuration file issue in GRID environment

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Configuration file issue in GRID environment

Post by arvind_ds »

Hello Experts,

I have a datastage parallel job which reads from a DB2 table and then writes the data to a dataset.The job runs on a GRID environment using Load Leveler as a work load management tool(resource manager). The job is running perfectly fine and below is the configuration file created at run time when we trigger the job through Load Leveler by using below GRID environment variables.This job utilizes both the compute nodes "dev_compute_01" and "dev_compute_02".

$APT_GRID_PARTITIONS=2
$APT_GRID_ENABLE=YES
$APT_GRID_COMPUTENODES=2


main_program: APT configuration file: /opt/IBM/LoadLjobdir/dsadm/141216/DataSet_Job/DataSet_Job_5872.config.apt
{
node "Conductor"
{
fastname "dev"
pools "conductor"
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute1"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute2"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute3"
{
fastname "dev_compute_02"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute4"
{
fastname "dev_compute_02"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
}


Now we have replaced Load Leveler with IBM LSF tool and when we run the same job, it gets finished successfully BUT it uses only one compute node and below is the configuration file created at run time when we trigger this job through LSF by using same GRID environment variables.This job uses only one compute node "dev_compute_01". It never uses second compute node "dev_compute_02" causing more load on one compute node only. This is happening with all parallel jobs.

$APT_GRID_PARTITIONS=2
$APT_GRID_ENABLE=YES
$APT_GRID_COMPUTENODES=2


main_program: APT configuration file: /opt/IBM/LSFjobdir/dsadm/141216/DataSet_Job/DataSet_Job_6443.config.apt
{
node "Conductor"
{
fastname "dev"
pools "conductor"
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute1"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute2"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute3"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute4"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
}


Any help on the resolution of this issue is much appreciated.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

How do you wish to utilize APT_GRID_COMPUTENODE ? If you ALWAYS want to span accross a different server when you specify more than one, then you can add that to your queue definition.


example:

Begin Queue
QUEUE_NAME = DS_ETL_NORMAL
PRIORITY = 80
USERS = all
INTERACTIVE = NO
RES_REQ = "span[ptile=1] rusage[ut=0.1:duration=1]"
ut = 0.85
DESCRIPTION = Default priority data stage queue
End Queue



that's what I have.

the span part says to force each new APT_GRID_COMPUTENODE request onto a unique hostname. The rusage and ut stuff is fancier stuff that says don't send work to the host if it's over 85% used, and for each job you send, reserve 10% of teh box, then decrement by 1% each second. It's to help prevent over saturation because the datastage job may not have rampped up to speed yet before your next job dispatch.



Some places don't use APT_GRID_COMPUTENODES to govern their host selection. They like going X by 1, instead of N by Y.
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

I want APT_GRID_COMPUTENODE to span across different servers(compute nodes i mean) if the value for this variable is greater than 1. I tried your suggestion by adding the span lines into my queues BUT it didn't work.The same problem persists even after adding span lines in the queue.

Any other suggestion on how to get rid off this issue.Thanks for your inputs.
Arvind
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

The span will work.

Did you do a "badmin reconfig" command after editing the lsb.queues file?
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Thank you so much PaulVL. It worked after running "badmin reconfig".
Arvind
Post Reply