Recommentation to pick right tool to process XML files

Formally known as "Mercator Inside Integrator 6.7", DataStage TX enables high-volume, complex transactions without the need for additional coding.

Moderators: chulett, rschirm

Post Reply
mohan
Premium Member
Premium Member
Posts: 8
Joined: Tue Aug 10, 2004 1:35 am

Recommentation to pick right tool to process XML files

Post by mohan »

Hi All,

Could Dsxchange members let me know , which is the right tool to process XML files , where volume is going to be huge , as huge as 1-2 million xml files per week , ie the loading of xml files needs to happen on weekend , we get xml files as bulk ,ie entire 1-2 million files in single shot.

People have been recommending WTX ,

, also in DS.8.1 - Designer we have WebSphere_TX_Map, i was wondering could we use it for same purpose .

Help in giving out your suggestions/input is highly appreciated.

Regards,
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do you actually have WebSphere Transformation Extender? If not, the WTX map stage will be of no use at all to you.

I would seriously think about DataStage 8.5 in this case, for which there's a set of XML tools that can handle arbitrarily large XML files very efficiently.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mohan
Premium Member
Premium Member
Posts: 8
Joined: Tue Aug 10, 2004 1:35 am

Post by mohan »

We have WTX 8.3 software license , and we have datastage 8.1 software ,
its very unlikely to buy DS 8.5 ,, could you please advise ,

Is it possible to use mapping created in WTX 8.3 and then use the same map in Ds 8.1 software ?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Tell us more... There isn't enough information to make any kind of conclusion..... let us know things like:

a) how big are the INDIVIDUAL xml documents. The size of a SINGLE document is the key here.
b) What "exactly" are you doing with them? Reading, transforming and loading into a data warehouse? An operational system? Other xml documents? Tell us all the possibilities.
c) You noted once a week.....how much time do you have in that window? One hour? 3?
d) How are you receiving them? In one big subdirectory with a million xml's to be loaded at once? Via messaging throughout the entire weekend, trickly in at random?

There are probably more, but those questions will be a good start.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Here are your options:
- DataStage TX has a good XML stage for reading or writing XML, however it means you will need to install both DataStage and DataStage TX and get to know both products. There is a DataStage TX plugin for DataStage jobs so the two products can share data within a job.
- DataStage 8.1 has a poor XML Pack that does not handle high data volumes, is slow to validation schemas and struggles with complex schemas. For high volumes I would avoid and consider using TX or preprocessing the file to convert it to relational data either prior to DataStage or using a DataStage Sequential file stage.
- DataStage 8.5 has a new XML Assembly which fixes most of the problems with XML processing in DataStage and is probably a better option than the previous two. It handles very high volumes and has parallel processing capabilities such as mutliple XML readers. You should be able to upgrade from 8.1 to 8.5 for free (as long as you have paid up maintenance) and the upgrade is relatively easy. Version 8.5 is easier to install than 8.1. The XML Assembly is not yet bundled into the 8.5 installer but is an added component for download from IBM Fix Central that is installed after 8.5 is installed.


You need to experiment to see how you can process the volume of XML files. In 8.5 you may be able to process all XML files in a directory at once, or merge the files into one large XML file.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm going to pipe in here from personal experience and clarify something both Vincent and Ernie have mentioned:
Vincent wrote:DataStage 8.1 has a poor XML Pack that does not handle high data volumes
Ernie wrote:how big are the INDIVIDUAL xml documents. The size of a SINGLE document is the key here.
IMHO, Ernie's question is key - "high data volumes" can be easily handled even in the Server product if we are talking small, easily digestable XML files. I've routinely done .5 million at a time so don't feel the additional volume you are talking about should be a problem either... again providing that the size of a SINGLE xml document is small. Now, if "high data volumes" mean huge files, then I'll agree with Vincent. I'm sure that, regardless, 8.5 would do whatever needs doing more better. And TX would handle things just fine too, I'd wager. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
mohan
Premium Member
Premium Member
Posts: 8
Joined: Tue Aug 10, 2004 1:35 am

Post by mohan »

Sorry folks for late reply ,

1.how big are the INDIVIDUAL xml documents. The size of a SINGLE document is the key here.

I am still yet to recieve the sample xml files , but they are telling it would some times around 200 mb or more. But those kind of huge files are less, than smaller sized xml files.

2.What "exactly" are you doing with them? Reading, transforming and loading into a data warehouse? An operational system? Other xml documents? Tell us all the possibilities.
Yes , we would be reading ,transforming and loading to data warehouse. The xml files are sent by m/c sold by my client , we need to track which m/c sent that info and what kind of info they have sent.


3.You noted once a week.....how much time do you have in that window? One hour? 3?
Yes , Those xml are recieved by other team and we recieve from them once in week, as of now SLA is not decided ,but sooner the better ,since we would not be having an explicit server to load and we will reuse server which are not used during weekend.

4) How are you receiving them? In one big subdirectory with a million xml's to be loaded at once? Via messaging throughout the entire weekend, trickly in at random?
Yes , in one big directory to be loaded once , though xml files are recieved dialy by other team ,but they are not agreeing to provide xml files on real time basis (may be a political statement by them ).

5. We have wtx 8.3 license , and 8.1 DS license, to be frank , i am yet to learn about WTX and how to use it.I just got to know , we have license and we need to use WTX
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Everything points to DataStage in your current release except the "200meg and larger" xml documents. DataStage does an excellent job of "shredding" xml documents into their relational counterparts. Most often, the tabular extraction and "dynamic normalization" of the hierarchy works quite naturally for the model of a target rdbms. At the very least, you will have rows of data coming from various nodes of the xml document and can easily manipulate them for rdbms tables.

Find out how real and how often the 200m documents arrive and see if you can get one. It's not a fixed value (the max). It could be 500m in your case --- it depends on the document, the population of the elements and attributes, how big those element names are, etc. etc. Get several and see if/when they blow up in DS.

TX uses a more direct reading method, as does 8.5. Get up to 8.5 DS if you can, or else consider MapStage (DataStage invoking a specific TX Map just for the reading part).

Also determine who is going to be supporting this. If you have 20 TX developers already, and no one who knows DS, then it's easy too (or if you have 20 DS Developers, etc.). Consider the long term maintenance and available skills as you make your decision.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
tminelgin
Premium Member
Premium Member
Posts: 13
Joined: Tue Oct 19, 2010 12:09 pm

My two cents;

Post by tminelgin »

We are currently trying to use the 8.5 fix pack XML stage with a very large XSD. By large I mead 9700 plus xpaths and over 65 or so objects. (patient admin, lab tests, results etc) in a modified HL7 format. Many of these entities go down several levels and can hold multiple values. IBM says there is a 2000 node (?) limit that causes the parsing to be 'chunked'. The data will be sent via XML messages (never a full set) and needs to end up in a EDW. There will be about 350 lookups of 'anchor' data to be processed as well as lookups to existing or lookup/create for new entities. As a DataStage developer I know I can't think in batch processing terms but I do think that using WTX to parse and send the messages to different message queues and then leveraging the ETL power of DataStage seems like a plausible approach. We also want to maintain transactional integrity through some sort of transaction grouping, which I read about in the documentation, but haven't figured out how to use yet. Are we biting off more than we can chew?
T. M.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

From what you've told us thus far, I don't think so (re: are you biting off more than you can chew)....but more detail is needed....the size of the xsd will impact your "design work", but has nothing to do with the run time....a HUGE xsd could still generate 100's of thousands of tiny "instance" documents, each pertaining to a particular transaction type, node, sub-node or whatever.

A run time decision needs other input, such as where the xml is coming from...how often...how large are the largest "single" documents, etc., etc.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
tminelgin
Premium Member
Premium Member
Posts: 13
Joined: Tue Oct 19, 2010 12:09 pm

More info;

Post by tminelgin »

the XSD document itself has a size of 248kb and the XML message will be coming from the message broker. The messages will be coming in constantly from 2500 facilities and I am sure there will be 100s of thousands of 'instance' documents for each patient object being sent. I have been unable to save the entire XSD using the meta data importer and am now attempting to break it up into its components.
T. M.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Call your support provider...there is a patch beyond 8.5 fp1 to the Library Manager for handling really big xsd's.... (btw --- the new 8.5 stage does not use the xml metadata importer, but instead brings xsd's into a new "library" concept)...

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply