Performance issue while reading Unstructured stage
Moderators: chulett, rschirm, roy
Performance issue while reading Unstructured stage
I have a sequence with a Unix script to get a list of file (xlsx) names, then passing this list to a loop. This is a a job sequence reading each Excel file (having 4 worksheets) with an Unstructured stage and writing to 4 data sets. Each iteration (reading 1 Excel file and writing 4 data sets) takes approximately 7 seconds in our Development environment. We may have to process many thousands of Excel files.
Is there a way to improve performance ? another job design or any specific environment variable configuration ?
Thank you.
Is there a way to improve performance ? another job design or any specific environment variable configuration ?
Thank you.
Thuong
best regards
best regards
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
How long does the job take when run independently? What else is the job doing (e.g. transformations)? Are there any obvious bottlenecks in your job design? Are the Excel files on the same UNIX machine as the DataStage engine?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ray, the elementary job runs even longer (25 sec) to read 1 Excel file (4 tabcards) , writing to 4 data sets. There is no transformation, only a simple constraint in the Transformer stage. All the Excel files are in the same folder on the Datastage server. There are approximately 20 columns (Varchar) per tabcard, and a few dozen records per tabcard.
When the elementary job is embedded in a loop (sequence job), it processes 1 Excel file per 7 or 8 seconds. This means approximately 500 files per hour.
When the elementary job is embedded in a loop (sequence job), it processes 1 Excel file per 7 or 8 seconds. This means approximately 500 files per hour.
Thuong
best regards
best regards
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Is the system overloaded? If not, consider using a multi-instance job so that you can have more than one instance running at the same time. With two instances, for example, you should get somewhere near 1000 per hour. Ramp up the number of instances while you still have spare capacity on the server.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thank you for your proposal.
The job which reads the Excel file writes 4 data sets (1 data set for each Excel tabcard) in append mode. Having a multi instance of the job would mean that 1 instance will have to wait for the other because they are writing to the same data sets. Any risk of 'deadlock' ?
The job which reads the Excel file writes 4 data sets (1 data set for each Excel tabcard) in append mode. Having a multi instance of the job would mean that 1 instance will have to wait for the other because they are writing to the same data sets. Any risk of 'deadlock' ?
Thuong
best regards
best regards
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Append mode is new information.
A simple solution would be a temporary database table which multiple streams can append to. Once all jobs are done this could be flushed into the eventual target.
A simple solution would be a temporary database table which multiple streams can append to. Once all jobs are done this could be flushed into the eventual target.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thank you for your suggestion.
We are thinking of another option: transforming the Excel files into csv files with Excel macro development, then present the csv files to Datastage instead of presenting Excel files (replacing the Unstructured stage by Sequential file stage as source).
We are thinking of another option: transforming the Excel files into csv files with Excel macro development, then present the csv files to Datastage instead of presenting Excel files (replacing the Unstructured stage by Sequential file stage as source).
Thuong
best regards
best regards
Our Datastage support suggested a better solution: use a wild card (Filename*) in the source Unstructured stage. We no longer need a loop, only an elementary job reading an Unstructured stage and writing to data sets. The performance is very good compared to the former design with a loop (less than 2 mn to process 500 Excel files with 4 tabcards).
Thuong
best regards
best regards
Ah... I wondered why the loop but assumed that if you went to the trouble of looping through the list of files it was purposely done, i.e. you had a need to process them all individually - something that is not an unheard of requirement. Glad you were able to leverage a filename wildcard pattern for this.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers