Page 1 of 1

Performance issue while reading Unstructured stage

Posted: Wed Nov 22, 2017 5:02 am
by le thuong
I have a sequence with a Unix script to get a list of file (xlsx) names, then passing this list to a loop. This is a a job sequence reading each Excel file (having 4 worksheets) with an Unstructured stage and writing to 4 data sets. Each iteration (reading 1 Excel file and writing 4 data sets) takes approximately 7 seconds in our Development environment. We may have to process many thousands of Excel files.
Is there a way to improve performance ? another job design or any specific environment variable configuration ?
Thank you.

Posted: Wed Nov 22, 2017 6:46 am
by chulett
When you say you may have "many thousands" do you have any idea what that number might actually (realistically) be? And how often will this process run?

Posted: Wed Nov 22, 2017 6:29 pm
by ray.wurlod
How long does the job take when run independently? What else is the job doing (e.g. transformations)? Are there any obvious bottlenecks in your job design? Are the Excel files on the same UNIX machine as the DataStage engine?

Posted: Thu Nov 23, 2017 7:05 am
by le thuong
chulett - This is a data migration , therefore each Excel file would be processed only once. We may process 1000 files or more per batch, total of files could reach 60000. Today, in Development environment, we could only process 500 files per hour.

Posted: Thu Nov 23, 2017 7:14 am
by le thuong
Ray, the elementary job runs even longer (25 sec) to read 1 Excel file (4 tabcards) , writing to 4 data sets. There is no transformation, only a simple constraint in the Transformer stage. All the Excel files are in the same folder on the Datastage server. There are approximately 20 columns (Varchar) per tabcard, and a few dozen records per tabcard.
When the elementary job is embedded in a loop (sequence job), it processes 1 Excel file per 7 or 8 seconds. This means approximately 500 files per hour.

Posted: Mon Nov 27, 2017 1:52 am
by ray.wurlod
Is the system overloaded? If not, consider using a multi-instance job so that you can have more than one instance running at the same time. With two instances, for example, you should get somewhere near 1000 per hour. Ramp up the number of instances while you still have spare capacity on the server.

Posted: Mon Nov 27, 2017 6:27 am
by le thuong
Thank you for your proposal.
The job which reads the Excel file writes 4 data sets (1 data set for each Excel tabcard) in append mode. Having a multi instance of the job would mean that 1 instance will have to wait for the other because they are writing to the same data sets. Any risk of 'deadlock' ?

Posted: Mon Nov 27, 2017 2:43 pm
by ray.wurlod
Append mode is new information.

A simple solution would be a temporary database table which multiple streams can append to. Once all jobs are done this could be flushed into the eventual target.

Posted: Tue Nov 28, 2017 2:38 am
by le thuong
Thank you for your suggestion.
We are thinking of another option: transforming the Excel files into csv files with Excel macro development, then present the csv files to Datastage instead of presenting Excel files (replacing the Unstructured stage by Sequential file stage as source).

Posted: Wed Nov 29, 2017 3:05 am
by le thuong
Our Datastage support suggested a better solution: use a wild card (Filename*) in the source Unstructured stage. We no longer need a loop, only an elementary job reading an Unstructured stage and writing to data sets. The performance is very good compared to the former design with a loop (less than 2 mn to process 500 Excel files with 4 tabcards).

Posted: Wed Nov 29, 2017 7:23 am
by chulett
Ah... I wondered why the loop but assumed that if you went to the trouble of looping through the list of files it was purposely done, i.e. you had a need to process them all individually - something that is not an unheard of requirement. Glad you were able to leverage a filename wildcard pattern for this. 8)