DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
le thuong



Group memberships:
Premium Members

Joined: 09 Sep 2009
Posts: 75

Points: 710

Post Posted: Wed Nov 22, 2017 5:02 am Reply with quote    Back to top    

DataStage® Release: 11x
Job Type: Parallel
OS: Unix
I have a sequence with a Unix script to get a list of file (xlsx) names, then passing this list to a loop. This is a a job sequence reading each Excel file (having 4 worksheets) with an Unstructured stage and writing to 4 data sets. Each iteration (reading 1 Excel file and writing 4 data sets) takes approximately 7 seconds in our Development environment. We may have to process many thousands of Excel files.
Is there a way to improve performance ? another job design or any specific environment variable configuration ?
Thank you.

_________________
Thuong

best regards
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42386
Location: Denver, CO
Points: 217652

Post Posted: Wed Nov 22, 2017 6:46 am Reply with quote    Back to top    

When you say you may have "many thousands" do you have any idea what that number might actually (realistically) be? And how often will this process run?

_________________
-craig

Pinky's as perfect as the Fourth of July, quilted and timeless, seldom denied
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54115
Location: Sydney, Australia
Points: 293498

Post Posted: Wed Nov 22, 2017 6:29 pm Reply with quote    Back to top    

How long does the job take when run independently? What else is the job doing (e.g. transformations)? Are there any obvious bottlenecks in your job design? Are the Excel files on the same UNIX mach ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
le thuong



Group memberships:
Premium Members

Joined: 09 Sep 2009
Posts: 75

Points: 710

Post Posted: Thu Nov 23, 2017 7:05 am Reply with quote    Back to top    

chulett - This is a data migration , therefore each Excel file would be processed only once. We may process 1000 files or more per batch, total of files could reach 60000. Today, in Development environment, we could only process 500 files per hour.

_________________
Thuong

best regards
Rate this response:  
Not yet rated
le thuong



Group memberships:
Premium Members

Joined: 09 Sep 2009
Posts: 75

Points: 710

Post Posted: Thu Nov 23, 2017 7:14 am Reply with quote    Back to top    

Ray, the elementary job runs even longer (25 sec) to read 1 Excel file (4 tabcards) , writing to 4 data sets. There is no transformation, only a simple constraint in the Transformer stage. All the Excel files are in the same folder on the Datastage server. There are approximately 20 columns (Varchar) per tabcard, and a few dozen records per tabcard.
When the elementary job is embedded in a loop (sequence job), it processes 1 Excel file per 7 or 8 seconds. This means approximately 500 files per hour.

_________________
Thuong

best regards
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54115
Location: Sydney, Australia
Points: 293498

Post Posted: Mon Nov 27, 2017 1:52 am Reply with quote    Back to top    

Is the system overloaded? If not, consider using a multi-instance job so that you can have more than one instance running at the same time. With two instances, for example, you should get somewhere ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
le thuong



Group memberships:
Premium Members

Joined: 09 Sep 2009
Posts: 75

Points: 710

Post Posted: Mon Nov 27, 2017 6:27 am Reply with quote    Back to top    

Thank you for your proposal.
The job which reads the Excel file writes 4 data sets (1 data set for each Excel tabcard) in append mode. Having a multi instance of the job would mean that 1 instance will have to wait for the other because they are writing to the same data sets. Any risk of 'deadlock' ?

_________________
Thuong

best regards
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54115
Location: Sydney, Australia
Points: 293498

Post Posted: Mon Nov 27, 2017 2:43 pm Reply with quote    Back to top    

Append mode is new information. A simple solution would be a temporary database table which multiple streams can append to. Once all jobs are done this could be flushed into the eventual targe ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
le thuong



Group memberships:
Premium Members

Joined: 09 Sep 2009
Posts: 75

Points: 710

Post Posted: Tue Nov 28, 2017 2:38 am Reply with quote    Back to top    

Thank you for your suggestion.
We are thinking of another option: transforming the Excel files into csv files with Excel macro development, then present the csv files to Datastage instead of presenting Excel files (replacing the Unstructured stage by Sequential file stage as source).

_________________
Thuong

best regards
Rate this response:  
Not yet rated
le thuong



Group memberships:
Premium Members

Joined: 09 Sep 2009
Posts: 75

Points: 710

Post Posted: Wed Nov 29, 2017 3:05 am Reply with quote    Back to top    

Our Datastage support suggested a better solution: use a wild card (Filename*) in the source Unstructured stage. We no longer need a loop, only an elementary job reading an Unstructured stage and writing to data sets. The performance is very good compared to the former design with a loop (less than 2 mn to process 500 Excel files with 4 tabcards).

_________________
Thuong

best regards
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42386
Location: Denver, CO
Points: 217652

Post Posted: Wed Nov 29, 2017 7:23 am Reply with quote    Back to top    

Ah... I wondered why the loop but assumed that if you went to the trouble of looping through the list of files it was purposely done, i.e. you had a need to process them all individually - something that is not an unheard of requirement. Glad you were able to leverage a filename wildcard pattern for this. Cool

_________________
-craig

Pinky's as perfect as the Fourth of July, quilted and timeless, seldom denied
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours