IBM Support for 1gb or more size XML file.

dsdevper · Post by **dsdevper** » Thu Mar 11, 2010 10:06 am

Hi

We are facing abnormal termination of XML input stage,when using 1gb file and runs smooth for files less than 250 Mb files.
We have opened a ticket with IBM to know the exact size XML stages can handle or to know the limit of the stages.I am kinda lost of what else to ask,its a dumb think to ask But can Please any one let me know what to ask IBM support people in regards to the XML stages.

They have asked some info before calling me like

1. the version.xml from the server and client machines.
2. Please send the dsenv, uvconfig and DSParams file from the Project
3.Job dsx

Thanks

chulett · Post by **chulett** » Thu Mar 11, 2010 10:31 am

There's not much to ask. You cannot process XML files that large and people that send them out like that should be shot. IMHO.

You'll need a 'pre-processing' step to break them up into digestable pieces or use a third-party parsing tool that doesn't need to suck the whole thing up into memory first. I would imagine there are 'stream parsers' out there somewhere.

Paging Dr Ostic, Dr Fine, Dr Ostic

chulett · Post by **chulett** » Thu Mar 11, 2010 11:03 am

If you really want to ask them something, how about asking them for their recommendation as to a 'best practice' for processing huge XML files in DataStage? I'd be curious what they say.

dsdevper · Post by **dsdevper** » Thu Mar 11, 2010 12:08 pm

Thanks Chulett,from reading so many posts on our Blog i came to the conclusion that it cannot process that big file but our company wants to take this to the IBM and want to know the Answer from them.

I was expecting a reply from them as "we cannot process such big file." as soon as they see the ticket.

But instead they were asking all these files information.So thought is there anything that i want to ask them.

I will make sure to ask them the what you have said.

eostic · Post by **eostic** » Thu Mar 11, 2010 2:03 pm

Late reply. Been on a long plane ride from Asia.... IBM is working on it, but we have a another quarter or two to wait. We're seeing more and more large XML......

In the meantime, you would need to break it up...I've seen people do it creatively in java, or using something like XMLMax (Windows based tool to help).

Alternatively, if you have it available, you could use WebSphere TX, by itself, or via MapStage from within DataStage. It uses a sax-style reader that can read the larger docs.

Ernie

dsdevper · Post by **dsdevper** » Thu Mar 11, 2010 4:37 pm

Hi, I do not know what to say but here is the reply i got from IBM support through mail.Still hoping them to call me.

""Thank you for the information. Found that the XML Input stage requires, on average, 5-7 times the size of the file in memory to process the document. The memory usage is based on
the actual structure and data within the document, and on the XPATH that defined in the job.
There is a risk of random job failures with large files, even when the memory usage is optimally configured.
The recommended solution is the input XML files should be kept as small as possible. The guideline is 100 MB or less for each file.""

They said there is risk of random job failures with out giving any reasons for it.

They didn't give the actual size or limit of the XML stages.

Any thoughts ?

Thanks

chulett · Post by **chulett** » Thu Mar 11, 2010 4:55 pm

There's too many variables, a file that's "too large" on my system may process fine on yours - and all that matters is your limit, not mine. About all I can suggest is you write something to generate test XML files of varying sizes and see how big you can push it until a reader job falls over dead.

Or start looking into something to chunk them up and try to stay under a limit, their "100MB or less" rule is a good general one. Heck, when I was building files for Google, they had that as a strict rule - as many files as we wanted to send, provided none of them were a byte over 100MB. One bad apple and the whole bushel basket was rejected.

vmcburney · Post by **vmcburney** » Fri Mar 12, 2010 12:50 am

The problem is that the DataStage XML input stage sees the entire file as one XML document so it tries to validate that entire document before it starts XML processing - that's what breaks the memory limits. This is fixed in DataStage 8.5 in a couple months but for now you can break it up into smaller files or you can read it through a sequential file stage and parse the XML in a transformer.

You might be able to stage the file in a DB2 PureXML database. Have a look at this article on a benchmark for processing a terabyte of XML documents:
http://www.ibm.com/developerworks/data/ ... index.html

And this one on retrieving database from PureXML using the DataStage DB2 Connector which supports xqueries and can shred the data without having to read all the XML at startup:
http://www.ibm.com/developerworks/data/ ... epurexml1/

lstsaur · Post by **lstsaur** » Fri Mar 12, 2010 2:53 am

XML input stage is using DOM parser and it creates a DOM tree in memory for a XML document. Anytime you are running with large XML document, XML Input stage crashes.

So, what you can do is using StAX parser to divide the large XML file into smaller DOM subtrees, and then each subtree is evaluated with XPath individually. It's no easy task, but I got it worked.

chulett · Post by **chulett** » Fri Mar 12, 2010 8:10 am

vmcburney wrote:This is fixed in DataStage 8.5

Now, that is interesting news. Any idea on the nature, the how of the fix?

ray.wurlod · Post by **ray.wurlod** » Fri Mar 12, 2010 1:38 pm

No, it was just a bullet point on the DataStage roadmap presentation at IOD 2009 conference last October.

Nagin · Post by **Nagin** » Wed Nov 03, 2010 12:01 pm

lstsaur,
Can you please let me know how I can use this work around? Which technology are these StAX parser and DOM subtrees built in?

Thanks,
Nagin.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 03, 2010 1:05 pm

You can get version 8.5 that does handle very large XML files using a totally redesigned technique that uses streams rather than trying to store the entire XML file in memory. This new stage is only available in parallel jobs, however.

eostic · Post by **eostic** » Wed Nov 03, 2010 1:53 pm

No....fully available in Server 8.5 also!

If you are not able to prepare for 8.5, the answer is above in the thread...you will need to break up the document externally (I've heard of creative solutions using Java and I know our own lab services offers the opportunity for such) and tools like XMLMax can do it.........or you need to read it with another tool such as WebSphere TX.......

Ernie

Nagin · Post by **Nagin** » Wed Nov 10, 2010 12:42 pm

Looks like we can't go to 8.5 yet. I am leaning towards splitting up the file with the help of a shell script.

But, I just heard about XSLT. If I use the style sheet, do you think DataStage will still read the entire XML file into memory.

In the Job have seen it looks like we are providing the XSLT file and the xml source file to XML Transformer Stage and all the parsing is happening. Looks like the parsing happening on Unix itself.

I think with this approach the entire XML does not need to be loaded into memory.

Any ideas?