Suggestions on APT_CONFIG_FILE

SachinCho · Post by **SachinCho** » Mon Aug 24, 2015 4:50 am

Hi All,
We have a DS project running on v9.1 and with below configuration file

{
        node "node1"
        {
                fastname "server1"
                pools ""
                resource disk "/data/Dataset" {pools ""}
                resource scratchdisk "/data/Temp" {pools ""}
        }
}

Currently both resource disk and resource scratchdisk are sitting on same mount /data which has around 1.5 TB allocated to it.
In order to have better i/o and avoiding dependencies on single file system we have created following:
/data/scratch01
/data/scratch02
/data/dataset01
/data/dataset02

All of these directories are divided in 400 GB size and are sitting on separate file systems underneath.

We plan to have two configuration files going forward. A single node and a two node and would like use these disks efficiently. We have db2 as target database and there are hevy sort operations involved as well.

Few questions
1. Is this a good strategy to break/divide scratchdisk and resource disk so as have better I/O
2. Any pointers or specific suggestions to arrangement of new config files?

Thanks in advance

rkashyap · Post by **rkashyap** » Mon Aug 24, 2015 7:58 am

First and foremost, jobs developed and tested on a single node APT config very often suffer from partitioning issues, so increasing number of nodes on them is challenging later on. It is usually a good idea to use single node APT config only in unavoidable cases e.g. ISD jobs etc.

A1: Yes.

A2: See Configuration File section (page 14 onwards) of thisdocument.

PaulVL · Post by **PaulVL** » Mon Aug 24, 2015 8:17 am

Single node APT files are perfect for jobs that simply trigger stored procedures on a database or perform database only SQL with no data transfer to/from DataStage.

Recommended 2 node apt layout:

Code: Select all

{
        node "node1"
        {
                fastname "server1"
                pools ""
                resource disk "/data/dataset01" {pools ""}
                resource disk "/data/dataset02" {pools ""}
                resource scratchdisk "/data/scratch01/Temp" {pools ""}
                resource scratchdisk "/data/scratch02/Temp" {pools ""}
        }
        node "node2"
        {
                fastname "server1"
                pools ""
                resource disk "/data/dataset01" {pools ""}
                resource disk "/data/dataset02" {pools ""}
                resource scratchdisk "/data/scratch02/Temp" {pools ""}
                resource scratchdisk "/data/scratch01/Temp" {pools ""}
        }
}

node2 should hit scratch02 first, and spill over into 01 if needed.
The reason you want a /Temp sub-directory is because you will most likely be using your scratch disk for other reasons like TEMPDIR.

TEMPDIR=/data/scratch01/TEMPDIR

You don't want that falling into default /tmp.

If you had multiple project you could also tuck in a project name into your scratch disk sub-directories in order to better clean up and detect whodunit situations.

rkashyap · Post by **rkashyap** » Tue Aug 25, 2015 6:12 am

Slightly off-topic ... Parallel jobs with single node APT config file can certainly be used to trigger external routines. Parallel jobs entail certain amount of pre-processing/initialization and I feel that Server jobs may be better suited for triggering external routines and perform utility type functionality. See this discussion.

PaulVL · Post by **PaulVL** » Tue Aug 25, 2015 8:53 am

True, but in a grid environment some admins restrict the use of server jobs because they do not get distributed and load balanced onto the grid.

I've been in two shops now that basically have a new policy of no new server jobs, and migrate the old server jobs to be parallel jobs (simply for the distributed workload aspect).

chulett · Post by **chulett** » Tue Aug 25, 2015 9:40 am

For a grid environment, sure. Otherwise...