Web service invocation with huge volume of data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Jayakannan
Participant
Posts: 73
Joined: Wed Sep 30, 2009 5:20 am

Web service invocation with huge volume of data

Post by Jayakannan »

Hi,

My job design is DataSet -> XML Output -> Web Service Transformer -> XML Input -> File

The voulme of the source dataset is around 2.8 million records for the first run (full load), 5 fields to be sent to web service in the request XML.

All the stages are running in parallel mode including XML and Web service transformer stages. Job took 1 hour to process 23000 records and its still going on so it will need approximately 120 hours to process 2.8 million records.

Is it possible to run this job in multi-instance mode to process certain amount of records in parallel through different instances? Or is there any other way to make the process faster?
Regards,
Kannan
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

A web service sending 2.8 million rows via soap? That's almost amusing. Why? There are so many better ways to send the data if it is in that volume. Your performance for parsing the rows may be aided by the new xml stage, but that much data via soap is going to be slow using most any technology. But if it has to be soap, encourage the authors of the service to provide a capability for batching or paging.....where you tell it how many rows to receive and then come back for more. That is how most web wervices function when they have to send lots of volume. Allow the client to control how much to receive in a particular call. Salesforce.com does this beautifully (as an example).

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Jayakannan
Participant
Posts: 73
Joined: Wed Sep 30, 2009 5:20 am

Post by Jayakannan »

Service team is invloved in this now.

So can we say that the bottleneck is only with the web service on the number of records being processed in a call?

Is there something can be done in DataStage job to increase the performance?
Regards,
Kannan
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

No, nothing you can do on the datastage side. The bottleneck is caused by processing huge volume of data using Web services which obviously isn't built for that kind of service. You need to ask your Web services team to splitting the large data that is transferred by the Web services into numerous smaller pieces. Then at a later stage, the Web service client merges the split information. I think that way will solve your performance problem.
Jayakannan
Participant
Posts: 73
Joined: Wed Sep 30, 2009 5:20 am

Post by Jayakannan »

Thank you.

DataStage job started on 10/19, run for around 48 hours and then the status changed to Crashed with below warning.

"Job control process (pid 62587014) has failed"

Previously, there were few warnings related to web service response - <faultcode>env:Server</faultcode><faultstring>Internal Error (from server)</faultstring> - on 10/19.

Even after the job status changed to Crahsed, target file in my job where the response is captured is still getting updated. I couldn't understand what's going on still. Any views?
Regards,
Kannan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Someone or something has knocked out the conductor node - hence the status of Crashed. Did someone issue a kill -9 command?

With no conductor with which to communicate, the section leaders will take some time to react to that situation. In the interim the player processes - those doing the actual work - will continue to execute.

Eventually, all else being equal, the section leaders will issue stop requests and clean up as best they can. None of this will be logged, since all logging is performed by the conductor process.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Jayakannan
Participant
Posts: 73
Joined: Wed Sep 30, 2009 5:20 am

Post by Jayakannan »

Not to my knowledge, I was making a change in another parallel job and compiled it. At the same time this job status changed to Crashed.

One more question. Since the response is still coming back, can we say that DataStage could have sent request to web service for all the 2.8 million records and web service is still processing it?

I assume that DataStage can send all the requests to web service as and when XML output stage generates it but it is upto web service to pick up certain number of records in a call, process and send back the response to DataStage. DataStage will parse the response as soon as web service returns one. Most of the time spent is in web service call only. Please correct me if my understanding is wrong.
Regards,
Kannan
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

There is a lot of things going on.....the service has to package all the rows into an xml SOAP envelope --- this envelope has to be put on the wire.....the SOAP client has to receive it and then parse it.....potential bottlenecks everywhere.

As noted above, using the xml stage instead of xmlInput stage after the web service call may help but no guarantees from what we know at this time.

In general, this is why many apps use web service to invoke secure ftp for large tranfer (or other mechanisms) or have started using JSON as a transport protocol, because it is (typically, not always) smaller in payload than xml.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply