Suggestions on APT_CONFIG_FILE

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
SachinCho
Participant
Posts: 45
Joined: Thu Jan 14, 2010 1:23 am
Location: Pune

Suggestions on APT_CONFIG_FILE

Post by SachinCho »

Hi All,
We have a DS project running on v9.1 and with below configuration file

Code: Select all

{
        node "node1"
        {
                fastname "server1"
                pools ""
                resource disk "/data/Dataset" {pools ""}
                resource scratchdisk "/data/Temp" {pools ""}
        }
}
Currently both resource disk and resource scratchdisk are sitting on same mount /data which has around 1.5 TB allocated to it.
In order to have better i/o and avoiding dependencies on single file system we have created following:
/data/scratch01
/data/scratch02
/data/dataset01
/data/dataset02

All of these directories are divided in 400 GB size and are sitting on separate file systems underneath.

We plan to have two configuration files going forward. A single node and a two node and would like use these disks efficiently. We have db2 as target database and there are hevy sort operations involved as well.

Few questions
1. Is this a good strategy to break/divide scratchdisk and resource disk so as have better I/O
2. Any pointers or specific suggestions to arrangement of new config files?

Thanks in advance
Sachin C
rkashyap
Premium Member
Premium Member
Posts: 532
Joined: Fri Dec 02, 2011 12:02 pm
Location: Richmond VA

Post by rkashyap »

First and foremost, jobs developed and tested on a single node APT config very often suffer from partitioning issues, so increasing number of nodes on them is challenging later on. It is usually a good idea to use single node APT config only in unavoidable cases e.g. ISD jobs etc.

A1: Yes.

A2: See Configuration File section (page 14 onwards) of thisdocument.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Single node APT files are perfect for jobs that simply trigger stored procedures on a database or perform database only SQL with no data transfer to/from DataStage.

Recommended 2 node apt layout:

Code: Select all

{
        node "node1"
        {
                fastname "server1"
                pools ""
                resource disk "/data/dataset01" {pools ""}
                resource disk "/data/dataset02" {pools ""}
                resource scratchdisk "/data/scratch01/Temp" {pools ""}
                resource scratchdisk "/data/scratch02/Temp" {pools ""}
        }
        node "node2"
        {
                fastname "server1"
                pools ""
                resource disk "/data/dataset01" {pools ""}
                resource disk "/data/dataset02" {pools ""}
                resource scratchdisk "/data/scratch02/Temp" {pools ""}
                resource scratchdisk "/data/scratch01/Temp" {pools ""}
        }
} 

node2 should hit scratch02 first, and spill over into 01 if needed.
The reason you want a /Temp sub-directory is because you will most likely be using your scratch disk for other reasons like TEMPDIR.

TEMPDIR=/data/scratch01/TEMPDIR

You don't want that falling into default /tmp.

If you had multiple project you could also tuck in a project name into your scratch disk sub-directories in order to better clean up and detect whodunit situations.
rkashyap
Premium Member
Premium Member
Posts: 532
Joined: Fri Dec 02, 2011 12:02 pm
Location: Richmond VA

Post by rkashyap »

Slightly off-topic ... Parallel jobs with single node APT config file can certainly be used to trigger external routines. Parallel jobs entail certain amount of pre-processing/initialization and I feel that Server jobs may be better suited for triggering external routines and perform utility type functionality. See this discussion.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

True, but in a grid environment some admins restrict the use of server jobs because they do not get distributed and load balanced onto the grid.

I've been in two shops now that basically have a new policy of no new server jobs, and migrate the old server jobs to be parallel jobs (simply for the distributed workload aspect).
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

For a grid environment, sure. Otherwise...
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply