XML stage is very slow?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
DSUser2000
Participant
Posts: 42
Joined: Tue Oct 20, 2009 8:36 am

XML stage is very slow?

Post by DSUser2000 »

We are using Datastage 8.7 and we try to use the new XML stage instead of a server job with the XML input stage. For a test, we tried to parse the XML joblogs (a few thousand) that Datastage can produce (DSJobReport). The job in both cases only does the import and peeks out data so it is for sure only limited to this stage. We noticed that the parallel stage uses all 4 AIX CPUs and with that is still slower by about 30s than the server stage which uses less than one CPU. It also seems to only process about 3 of these XML files per CPU in parallel mode which is very low (they are just few kbs in size!). Is that normal? Is there some tuning parameters?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Your mileage may vary. There are a whole lot of things to consider and if the xml is tiny, it is likely that it is simply EE Job overhead that is taking up the time, and not the xml parsing.

Not sure how much xml data you are testing with....the xml stage's best advantage will be when you have many large documents..... say....20, 30, 50, 100, 200 megabytes in size, and hundreds of them. That's when it's smarter ability to only expand nodes that it needs, and to read documents bit by bit, instead of loading the whole thing into memory, will start to shine.

In recent testing I did for another site, I had a gig worth of xml documents (about 40 of them) in a subdirectory, and the new stage, for that Job and that structure, about 30 times faster.

In the small document area, ESPECIALLY when there are few of them (<5 meg in size or total), then the simple overhead of loading EE and loading the Stage made the Server Job far faster. The xml might have still parsed quicker with the xml Stage, but the set-up took 5 times longer.

In summary, the xml stage will be much faster when:

a) you have lots of xml content "per" document
b) you have lots of xml documents.
c) you have a complex node path (1000's of elements and attributes) and you are only selecting a small subset of them.

I applaud your efforts to use something like a log file from ds as a test, but try to find an xml document with 10 or 12 hierarchical paths, and retrieve from one of those, and you will see significant differences.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply