IBM Support for 1gb or more size XML file.

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

dsdevper
Premium Member
Premium Member
Posts: 86
Joined: Tue Aug 19, 2008 9:31 am

IBM Support for 1gb or more size XML file.

Post by dsdevper »

Hi

We are facing abnormal termination of XML input stage,when using 1gb file and runs smooth for files less than 250 Mb files.
We have opened a ticket with IBM to know the exact size XML stages can handle or to know the limit of the stages.I am kinda lost of what else to ask,its a dumb think to ask But can Please any one let me know what to ask IBM support people in regards to the XML stages.

They have asked some info before calling me like

1. the version.xml from the server and client machines.
2. Please send the dsenv, uvconfig and DSParams file from the Project
3.Job dsx


Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There's not much to ask. You cannot process XML files that large and people that send them out like that should be shot. IMHO. :wink:

You'll need a 'pre-processing' step to break them up into digestable pieces or use a third-party parsing tool that doesn't need to suck the whole thing up into memory first. I would imagine there are 'stream parsers' out there somewhere.

Paging Dr Ostic, Dr Fine, Dr Ostic
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you really want to ask them something, how about asking them for their recommendation as to a 'best practice' for processing huge XML files in DataStage? I'd be curious what they say.
-craig

"You can never have too many knives" -- Logan Nine Fingers
dsdevper
Premium Member
Premium Member
Posts: 86
Joined: Tue Aug 19, 2008 9:31 am

Post by dsdevper »

Thanks Chulett,from reading so many posts on our Blog i came to the conclusion that it cannot process that big file but our company wants to take this to the IBM and want to know the Answer from them.

I was expecting a reply from them as "we cannot process such big file." as soon as they see the ticket.

But instead they were asking all these files information.So thought is there anything that i want to ask them.

I will make sure to ask them the what you have said.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Late reply. Been on a long plane ride from Asia.... IBM is working on it, but we have a another quarter or two to wait. We're seeing more and more large XML......

In the meantime, you would need to break it up...I've seen people do it creatively in java, or using something like XMLMax (Windows based tool to help).

Alternatively, if you have it available, you could use WebSphere TX, by itself, or via MapStage from within DataStage. It uses a sax-style reader that can read the larger docs.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
dsdevper
Premium Member
Premium Member
Posts: 86
Joined: Tue Aug 19, 2008 9:31 am

Post by dsdevper »

Hi, I do not know what to say but here is the reply i got from IBM support through mail.Still hoping them to call me.

""Thank you for the information. Found that the XML Input stage requires, on average, 5-7 times the size of the file in memory to process the document. The memory usage is based on
the actual structure and data within the document, and on the XPATH that defined in the job.
There is a risk of random job failures with large files, even when the memory usage is optimally configured.
The recommended solution is the input XML files should be kept as small as possible. The guideline is 100 MB or less for each file.""

They said there is risk of random job failures with out giving any reasons for it.

They didn't give the actual size or limit of the XML stages.

Any thoughts ?


Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There's too many variables, a file that's "too large" on my system may process fine on yours - and all that matters is your limit, not mine. About all I can suggest is you write something to generate test XML files of varying sizes and see how big you can push it until a reader job falls over dead.

Or start looking into something to chunk them up and try to stay under a limit, their "100MB or less" rule is a good general one. Heck, when I was building files for Google, they had that as a strict rule - as many files as we wanted to send, provided none of them were a byte over 100MB. One bad apple and the whole bushel basket was rejected.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

The problem is that the DataStage XML input stage sees the entire file as one XML document so it tries to validate that entire document before it starts XML processing - that's what breaks the memory limits. This is fixed in DataStage 8.5 in a couple months but for now you can break it up into smaller files or you can read it through a sequential file stage and parse the XML in a transformer.

You might be able to stage the file in a DB2 PureXML database. Have a look at this article on a benchmark for processing a terabyte of XML documents:
http://www.ibm.com/developerworks/data/ ... index.html

And this one on retrieving database from PureXML using the DataStage DB2 Connector which supports xqueries and can shred the data without having to read all the XML at startup:
http://www.ibm.com/developerworks/data/ ... epurexml1/
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

XML input stage is using DOM parser and it creates a DOM tree in memory for a XML document. Anytime you are running with large XML document, XML Input stage crashes.

So, what you can do is using StAX parser to divide the large XML file into smaller DOM subtrees, and then each subtree is evaluated with XPath individually. It's no easy task, but I got it worked.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

vmcburney wrote:This is fixed in DataStage 8.5
Now, that is interesting news. Any idea on the nature, the how of the fix?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No, it was just a bullet point on the DataStage roadmap presentation at IOD 2009 conference last October.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Nagin
Charter Member
Charter Member
Posts: 89
Joined: Thu Jan 26, 2006 12:37 pm

Post by Nagin »

lstsaur,
Can you please let me know how I can use this work around? Which technology are these StAX parser and DOM subtrees built in?

Thanks,
Nagin.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can get version 8.5 that does handle very large XML files using a totally redesigned technique that uses streams rather than trying to store the entire XML file in memory. This new stage is only available in parallel jobs, however.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

No....fully available in Server 8.5 also!

If you are not able to prepare for 8.5, the answer is above in the thread...you will need to break up the document externally (I've heard of creative solutions using Java and I know our own lab services offers the opportunity for such) and tools like XMLMax can do it.........or you need to read it with another tool such as WebSphere TX.......

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Nagin
Charter Member
Charter Member
Posts: 89
Joined: Thu Jan 26, 2006 12:37 pm

Post by Nagin »

Looks like we can't go to 8.5 yet. I am leaning towards splitting up the file with the help of a shell script.

But, I just heard about XSLT. If I use the style sheet, do you think DataStage will still read the entire XML file into memory.

In the Job have seen it looks like we are providing the XSLT file and the xml source file to XML Transformer Stage and all the parsing is happening. Looks like the parsing happening on Unix itself.

I think with this approach the entire XML does not need to be loaded into memory.

Any ideas?
Post Reply