Hello Experts,
I have a datastage parallel job which reads from a DB2 table and then writes the data to a dataset.The job runs on a GRID environment using Load Leveler as a work load management tool(resource manager). The job is running perfectly fine and below is the configuration file created at run time when we trigger the job through Load Leveler by using below GRID environment variables.This job utilizes both the compute nodes "dev_compute_01" and "dev_compute_02".
$APT_GRID_PARTITIONS=2
$APT_GRID_ENABLE=YES
$APT_GRID_COMPUTENODES=2
main_program: APT configuration file: /opt/IBM/LoadLjobdir/dsadm/141216/DataSet_Job/DataSet_Job_5872.config.apt
{
node "Conductor"
{
fastname "dev"
pools "conductor"
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute1"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute2"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute3"
{
fastname "dev_compute_02"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute4"
{
fastname "dev_compute_02"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
}
Now we have replaced Load Leveler with IBM LSF tool and when we run the same job, it gets finished successfully BUT it uses only one compute node and below is the configuration file created at run time when we trigger this job through LSF by using same GRID environment variables.This job uses only one compute node "dev_compute_01". It never uses second compute node "dev_compute_02" causing more load on one compute node only. This is happening with all parallel jobs.
$APT_GRID_PARTITIONS=2
$APT_GRID_ENABLE=YES
$APT_GRID_COMPUTENODES=2
main_program: APT configuration file: /opt/IBM/LSFjobdir/dsadm/141216/DataSet_Job/DataSet_Job_6443.config.apt
{
node "Conductor"
{
fastname "dev"
pools "conductor"
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute1"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute2"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute3"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
node "Compute4"
{
fastname "dev_compute_01"
pools ""
resource disk "/opt/IBM/Datasets" {pools ""}
resource scratchdisk "/var/scratch" {pools ""}
}
}
Any help on the resolution of this issue is much appreciated.
Configuration file issue in GRID environment
Moderators: chulett, rschirm, roy
How do you wish to utilize APT_GRID_COMPUTENODE ? If you ALWAYS want to span accross a different server when you specify more than one, then you can add that to your queue definition.
example:
Begin Queue
QUEUE_NAME = DS_ETL_NORMAL
PRIORITY = 80
USERS = all
INTERACTIVE = NO
RES_REQ = "span[ptile=1] rusage[ut=0.1:duration=1]"
ut = 0.85
DESCRIPTION = Default priority data stage queue
End Queue
that's what I have.
the span part says to force each new APT_GRID_COMPUTENODE request onto a unique hostname. The rusage and ut stuff is fancier stuff that says don't send work to the host if it's over 85% used, and for each job you send, reserve 10% of teh box, then decrement by 1% each second. It's to help prevent over saturation because the datastage job may not have rampped up to speed yet before your next job dispatch.
Some places don't use APT_GRID_COMPUTENODES to govern their host selection. They like going X by 1, instead of N by Y.
example:
Begin Queue
QUEUE_NAME = DS_ETL_NORMAL
PRIORITY = 80
USERS = all
INTERACTIVE = NO
RES_REQ = "span[ptile=1] rusage[ut=0.1:duration=1]"
ut = 0.85
DESCRIPTION = Default priority data stage queue
End Queue
that's what I have.
the span part says to force each new APT_GRID_COMPUTENODE request onto a unique hostname. The rusage and ut stuff is fancier stuff that says don't send work to the host if it's over 85% used, and for each job you send, reserve 10% of teh box, then decrement by 1% each second. It's to help prevent over saturation because the datastage job may not have rampped up to speed yet before your next job dispatch.
Some places don't use APT_GRID_COMPUTENODES to govern their host selection. They like going X by 1, instead of N by Y.
I want APT_GRID_COMPUTENODE to span across different servers(compute nodes i mean) if the value for this variable is greater than 1. I tried your suggestion by adding the span lines into my queues BUT it didn't work.The same problem persists even after adding span lines in the queue.
Any other suggestion on how to get rid off this issue.Thanks for your inputs.
Any other suggestion on how to get rid off this issue.Thanks for your inputs.
Arvind