"ReferenceInputs" in Parallel Jobs

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

"ReferenceInputs" in Parallel Jobs

Post by ray.wurlod »

What would be a really useful guide for folks making the transition from server jobs - maybe multi-instance server jobs - to parallel jobs is a quick summary or overview of the differences in design methodology, in particular that parallel jobs work with data sets on links and therefore "reference lookup" techniques (a) are different, (b) tend not to use hashed file, because the data sets are memory-resident, and (c) have a number of stage types to support them.
The manual, after all, is a reference manual; it does not really do a good job of explaining some of the "how to" aspects.
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Post by bigpoppa »

Ray,

You're right. PX doesn't have a real need for hashed files. Data flows through PX jobs in a much different way than it flows through DS Server jobs.

PX starts by loading as much data into memory as possible. Data is transferred between stages in buffers. When a buffer fills up, PX processes it and sends it downstream. So, many stages may be concurrently working at the same time, and this is the idea behind 'pipeline' parallelism. PX also takes advantage of partioned paralellism - which basically means dividing data into multiple streams and running copies of the same stage in parallel. PX only hits the disks when it has to handle more data than can fit in memory.

One of the most important design rules is to limit sorts and hashes. Sorting and hashing are the mostly costly operations in PX because they almost always hit the disk. You can look in the temporary work space and watch it fill up with files during a sort or hash stage. However, you can control how much memory sorts and hashes use, thereby controlling how quickly PX will start going to disk.

Ultimately, there are a ton of customizable env vars that help you control various memory and buffer settings. I remember seeing a list of 30 or so env vars in the PX admin guide, but I'm not sure where they are in the docs now. If you can't find them in the current docs, Ascential should be able to provide you with a list. Understanding these vars is key to building performant PX jobs.

-
BP
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

bigpoppa wrote:Ray,
One of the most important design rules is to limit sorts and hashes. Sorting and hashing are the mostly costly operations in PX because they almost always hit the disk. You can look in the temporary work space and watch it fill up with files during a sort or hash stage. However, you can control how much memory sorts and hashes use, thereby controlling how quickly PX will start going to disk.
You mentioned on the ability to control the amount of memory used for sorting. Can you point me to the correct ENV variables controlling that?

Thanks!

-T.J.
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Sort options

Post by bigpoppa »

T.J,

Firstly, PX used to automatically insert hashes and sorts into PX jobs. This was done to make it easier for people with no parallel experience to write performant parallel jobs. I'm not sure if PX still does this, but I recommend that experienced PX users turn off the automatic insertion of hashes/sorts. In the OSH scripting language, the osh command has the "-nosortinsertion' and a '-nopartinsertion' options. To get the GUI to run without the automatic inserts, I believe you need to add a new 'osh' command to the PX bin directory. This command would call the original osh command with two options above.

The key to optimizing sort performance is to have enough memory and disk space to accomodate the amount of data you are putting through the sort. A PX sort will run faster if it does not have to write to the scratch disk. If you do not have enough memory to accomodate your data, then you can provide 'fast' disks for scratch space to accelerate your sorts.

To my knowledge, there are no env vars that can control sort performance. However, you can control sort performance in the following three ways:

1. By specifying a sort node pool
2. By specifying a sort scratch disk pool
3. By upping the memory option for tsort

The sort node pool in the configuration file specifies the nodes that PX will use for all system resources except for disk. The scratch disk pool specifies the directories that tsort will use when it runs out of memory to perform the sort. In theory, configuration files for shared-everything systems should be much simpler than those for shared-nothing systems. In shared-nothing systems, PX users can take advantage of multiple, separate disks and memory resources for more performant sorts.

The memory option on tsort specifies the amount (in MBs) of memory PX wil user per node for the sort. The larger the value, the less often the sort will write to disk.

Hopefully, from this discussion, you will come to the conclusion that you should really know the physical configuration of the machine(s) on which PX will be running. The input of a good sys admin will help you write a good PX config file that will lead to performant, scalable PX jobs!

- B.P.
Post Reply