Scratchdisk requirement

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
yakiku
Premium Member
Premium Member
Posts: 23
Joined: Thu May 13, 2004 7:14 am

Scratchdisk requirement

Post by yakiku »

Hi,

We are looking at calculating the size of required scratchdisk for a new build. And one of the configuration suggestion documents states the following:

-------------------
The size of scratch/sort areas should be (X/N) + X*1.35 (or X*1.45)
where 'X' is the size of the largest data file
and 'N' is the number a ways expected that most jobs run in parallel

For example an 4GB input file, on an 8 way system. 4GB/8CPU + (4GB * 35%) = ~2GB for each scratch/sort space (.5 GB + 1.4 GB) per partition, 8 in this case.
-----------------------------

I notice that there is a discrepancy between the formula given and the example calculation. Does anyone know whether it should be X*0.35 or X*1.35. I wanted to check if anyone has run into this before.

Also, in your experience, how did you decide on the size of the scratchdisk?

Thanks,
yakiku.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I would say the number intended is 0.35 but somehow that formula doesn't look correct to me - are the parentheses correct?

But that only takes one process into account. What if you had 10 concurrent jobs running with different data files that are all at the maximum size? That would fill up scratch quite quickly. In reality each site is different and the actual size is very dependant upon how the applications are designed and how they are run. The rough initial sizing guide might have to be adapted to actual conditions several times.

(X/N) in your case would give you the size used in each scratch area and then they add a 35% or 45% overhead which I suppose is reasonable enough.
r.bhatia
Participant
Posts: 11
Joined: Mon Jun 30, 2008 12:45 am
Location: Manchester

Post by r.bhatia »

Job Description:
The job uses 3 input files say 4 GB each being joined using a Join stage and then transformation applied before they are split into multiple output datasets.

I am using a 4-node config file. (4 CPU)

I plan to execute as large as 6 such jobs in parallell, what would be the scratch space required?

based on emprical formula above
=6 * ( ( (4*3)/4 ) + (4*3) *0.35 )
=6* (3+3.9)
=6*6.9
~ 42 GB

Is this calculation correct for my scrathdisk requirements?

What would be the other space requirements that i need to take care of? ( i know resource disk is one but not aware on how to estimate that...)
rakesh bhatia
Post Reply