dataset read problem

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
adasgupta123
Participant
Posts: 42
Joined: Fri Oct 20, 2006 1:58 am

dataset read problem

Post by adasgupta123 »

Hi all,

I have been assigned to tune some parallel jobs.
I observed in some jobs reading from dataset is taking huge time.

Pls advice me to increase no. of rows/sec while reading from dataset.

thanks and regards

Avik Dasgupta
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Re: dataset read problem

Post by tagnihotri »

Also look in config file and check for the filesystem and mounts for the directories mentioned in for scratch and resource!
Then we can talk

adasgupta123 wrote:Hi all,

I have been assigned to tune some parallel jobs.
I observed in some jobs reading from dataset is taking huge time.

Pls advice me to increase no. of rows/sec while reading from dataset.

thanks and regards

Avik Dasgupta
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What are your parallel job tuning credentials? That is, why did they give you the task? How much experience do you have in this area?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
adasgupta123
Participant
Posts: 42
Joined: Fri Oct 20, 2006 1:58 am

Datastage read problem

Post by adasgupta123 »

ray.wurlod wrote:What are your parallel job tuning credentials? That is, why did they give you the task? How much experience do you have in this area?
I am very new to datastage.I developed some parallel jobs in last two
months.
adasgupta123
Participant
Posts: 42
Joined: Fri Oct 20, 2006 1:58 am

Post by adasgupta123 »

balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.
adasgupta123
Participant
Posts: 42
Joined: Fri Oct 20, 2006 1:58 am

Post by adasgupta123 »

Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

RCP should not effect the performance. If data is well distributed and file mount are proper (i.e. individual filesystem mount for nodes) then are you sure that the issue is while reading dataset!

The performance issue may be because of some other processing you are doing in your job. How exactly have you blamed dataset read, can you elaborate please :?:
adasgupta123 wrote:Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote: Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.
adasgupta123
Participant
Posts: 42
Joined: Fri Oct 20, 2006 1:58 am

Post by adasgupta123 »

Basically we are handling huge amont of data every day(around 300GB!)
and it is getting larger and lager every month.

In most of the jobs the dataset is the first stage and final o/p stage i.e
the output dataset of one job is acting as a input to the next job.
In the jobs there are mainly join and transformation stages.In some
cases there are funnel,filter stages.

I am guessing dataset read problem because in all other stages out put
links the no. o rows per second is much higher than in the case of dataset.





tagnihotri wrote:RCP should not effect the performance. If data is well distributed and file mount are proper (i.e. individual filesystem mount for nodes) then are you sure that the issue is while reading dataset!

The performance issue may be because of some other processing you are doing in your job. How exactly have you blamed dataset read, can you elaborate please :?:
adasgupta123 wrote:Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Etiquette Note
It is not necessary to overquote all previous replies - they're there in the thread. Also, using Quote severely restricts your ability to earn points.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

Ray, I will take a note of this from there on! thanks


Adasgupta,
Can You please detail your job design
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Rows/sec is an almost completely meaningless metric. Various factors influence it, usually negatively, such as row width, network bottlenecks, the clock still running after all rows have been processed, and so on. I have posted before on this. There can be no such thing as an answer to the question "what is a typical rows/sec?". The main way to increase the read rate from a Data Set is to increase buffer sizes and not to have any slower stage types downstream of it. But sometimes you just have to. All else being equal, minimize the time taken by ensuring that rows are distributed equally across all partitions when the Data Set is populated.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply