Performance issue while reading Unstructured stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
le thuong
Premium Member
Premium Member
Posts: 76
Joined: Wed Sep 09, 2009 5:21 am

Performance issue while reading Unstructured stage

Post by le thuong »

I have a sequence with a Unix script to get a list of file (xlsx) names, then passing this list to a loop. This is a a job sequence reading each Excel file (having 4 worksheets) with an Unstructured stage and writing to 4 data sets. Each iteration (reading 1 Excel file and writing 4 data sets) takes approximately 7 seconds in our Development environment. We may have to process many thousands of Excel files.
Is there a way to improve performance ? another job design or any specific environment variable configuration ?
Thank you.
Thuong

best regards
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

When you say you may have "many thousands" do you have any idea what that number might actually (realistically) be? And how often will this process run?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How long does the job take when run independently? What else is the job doing (e.g. transformations)? Are there any obvious bottlenecks in your job design? Are the Excel files on the same UNIX machine as the DataStage engine?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
le thuong
Premium Member
Premium Member
Posts: 76
Joined: Wed Sep 09, 2009 5:21 am

Post by le thuong »

chulett - This is a data migration , therefore each Excel file would be processed only once. We may process 1000 files or more per batch, total of files could reach 60000. Today, in Development environment, we could only process 500 files per hour.
Thuong

best regards
le thuong
Premium Member
Premium Member
Posts: 76
Joined: Wed Sep 09, 2009 5:21 am

Post by le thuong »

Ray, the elementary job runs even longer (25 sec) to read 1 Excel file (4 tabcards) , writing to 4 data sets. There is no transformation, only a simple constraint in the Transformer stage. All the Excel files are in the same folder on the Datastage server. There are approximately 20 columns (Varchar) per tabcard, and a few dozen records per tabcard.
When the elementary job is embedded in a loop (sequence job), it processes 1 Excel file per 7 or 8 seconds. This means approximately 500 files per hour.
Thuong

best regards
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is the system overloaded? If not, consider using a multi-instance job so that you can have more than one instance running at the same time. With two instances, for example, you should get somewhere near 1000 per hour. Ramp up the number of instances while you still have spare capacity on the server.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
le thuong
Premium Member
Premium Member
Posts: 76
Joined: Wed Sep 09, 2009 5:21 am

Post by le thuong »

Thank you for your proposal.
The job which reads the Excel file writes 4 data sets (1 data set for each Excel tabcard) in append mode. Having a multi instance of the job would mean that 1 instance will have to wait for the other because they are writing to the same data sets. Any risk of 'deadlock' ?
Thuong

best regards
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Append mode is new information.

A simple solution would be a temporary database table which multiple streams can append to. Once all jobs are done this could be flushed into the eventual target.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
le thuong
Premium Member
Premium Member
Posts: 76
Joined: Wed Sep 09, 2009 5:21 am

Post by le thuong »

Thank you for your suggestion.
We are thinking of another option: transforming the Excel files into csv files with Excel macro development, then present the csv files to Datastage instead of presenting Excel files (replacing the Unstructured stage by Sequential file stage as source).
Thuong

best regards
le thuong
Premium Member
Premium Member
Posts: 76
Joined: Wed Sep 09, 2009 5:21 am

Post by le thuong »

Our Datastage support suggested a better solution: use a wild card (Filename*) in the source Unstructured stage. We no longer need a loop, only an elementary job reading an Unstructured stage and writing to data sets. The performance is very good compared to the former design with a loop (less than 2 mn to process 500 Excel files with 4 tabcards).
Thuong

best regards
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ah... I wondered why the loop but assumed that if you went to the trouble of looping through the list of files it was purposely done, i.e. you had a need to process them all individually - something that is not an unheard of requirement. Glad you were able to leverage a filename wildcard pattern for this. 8)
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply