dataset read problem

adasgupta123 · Post by **adasgupta123** » Wed Nov 08, 2006 1:49 am

Hi all,

I have been assigned to tune some parallel jobs.
I observed in some jobs reading from dataset is taking huge time.

Pls advice me to increase no. of rows/sec while reading from dataset.

thanks and regards

Avik Dasgupta

balajisr · Post by **balajisr** » Wed Nov 08, 2006 3:34 am

How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.

tagnihotri · Post by **tagnihotri** » Wed Nov 08, 2006 7:13 am

Also look in config file and check for the filesystem and mounts for the directories mentioned in for scratch and resource!
Then we can talk

adasgupta123 wrote:Hi all,

I have been assigned to tune some parallel jobs.
I observed in some jobs reading from dataset is taking huge time.

Pls advice me to increase no. of rows/sec while reading from dataset.

thanks and regards

Avik Dasgupta

ray.wurlod · Post by **ray.wurlod** » Wed Nov 08, 2006 1:08 pm

What are your parallel job tuning credentials? That is, why did they give you the task? How much experience do you have in this area?

adasgupta123 · Post by **adasgupta123** » Wed Nov 08, 2006 11:40 pm

ray.wurlod wrote:What are your parallel job tuning credentials? That is, why did they give you the task? How much experience do you have in this area?

I am very new to datastage.I developed some parallel jobs in last two
months.

adasgupta123 · Post by **adasgupta123** » Wed Nov 08, 2006 11:46 pm

balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.

Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

tagnihotri · Post by **tagnihotri** » Thu Nov 09, 2006 12:00 am

Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).

adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

tagnihotri · Post by **tagnihotri** » Thu Nov 09, 2006 12:01 am

Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).

adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

adasgupta123 · Post by **adasgupta123** » Thu Nov 09, 2006 2:06 am

Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).

adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

tagnihotri · Post by **tagnihotri** » Thu Nov 09, 2006 7:51 am

RCP should not effect the performance. If data is well distributed and file mount are proper (i.e. individual filesystem mount for nodes) then are you sure that the issue is while reading dataset!

The performance issue may be because of some other processing you are doing in your job. How exactly have you blamed dataset read, can you elaborate please

adasgupta123 wrote:Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).

adasgupta123 wrote: Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

adasgupta123 · Post by **adasgupta123** » Thu Nov 09, 2006 10:16 am

Basically we are handling huge amont of data every day(around 300GB!)
and it is getting larger and lager every month.

In most of the jobs the dataset is the first stage and final o/p stage i.e
the output dataset of one job is acting as a input to the next job.
In the jobs there are mainly join and transformation stages.In some
cases there are funnel,filter stages.

I am guessing dataset read problem because in all other stages out put
links the no. o rows per second is much higher than in the case of dataset.

tagnihotri wrote:RCP should not effect the performance. If data is well distributed and file mount are proper (i.e. individual filesystem mount for nodes) then are you sure that the issue is while reading dataset!

The performance issue may be because of some other processing you are doing in your job. How exactly have you blamed dataset read, can you elaborate please

adasgupta123 wrote:Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).

ray.wurlod · Post by **ray.wurlod** » Thu Nov 09, 2006 1:08 pm

Etiquette Note
It is not necessary to overquote all previous replies - they're there in the thread. Also, using Quote severely restricts your ability to earn points.

tagnihotri · Post by **tagnihotri** » Thu Nov 09, 2006 11:45 pm

Ray, I will take a note of this from there on! thanks

Adasgupta,
Can You please detail your job design

ray.wurlod · Post by **ray.wurlod** » Fri Nov 10, 2006 12:46 am

Rows/sec is an almost completely meaningless metric. Various factors influence it, usually negatively, such as row width, network bottlenecks, the clock still running after all rows have been processed, and so on. I have posted before on this. There can be no such thing as an answer to the question "what is a typical rows/sec?". The main way to increase the read rate from a Data Set is to increase buffer sizes and not to have any slower stage types downstream of it. But sometimes you just have to. All else being equal, minimize the time taken by ensuring that rows are distributed equally across all partitions when the Data Set is populated.

DSXchange

dataset read problem

dataset read problem

Re: dataset read problem

Datastage read problem