DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
mohan



Group memberships:
Premium Members

Joined: 10 Aug 2004
Posts: 8

Points: 97

Post Posted: Fri Mar 11, 2011 6:15 pm Reply with quote    Back to top    

DataStage® Release: 8x
Job Type: TX
OS: Unix
Hi All,

Could Dsxchange members let me know , which is the right tool to process XML files , where volume is going to be huge , as huge as 1-2 million xml files per week , ie the loading of xml files needs to happen on weekend , we get xml files as bulk ,ie entire 1-2 million files in single shot.

People have been recommending WTX ,

, also in DS.8.1 - Designer we have WebSphere_TX_Map, i was wondering could we use it for same purpose .

Help in giving out your suggestions/input is highly appreciated.

Regards,
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54254
Location: Sydney, Australia
Points: 294257

Post Posted: Fri Mar 11, 2011 8:36 pm Reply with quote    Back to top    

Do you actually have WebSphere Transformation Extender? If not, the WTX map stage will be of no use at all to you. I would seriously think about DataStage 8.5 in this case, for which there's a ...

_________________
RXP Services Ltd
Melbourne | Canberra | Sydney | Hong Kong | Hobart | Brisbane
currently hiring: Canberra, Sydney and Melbourne
Rate this response:  
Not yet rated
mohan



Group memberships:
Premium Members

Joined: 10 Aug 2004
Posts: 8

Points: 97

Post Posted: Mon Mar 14, 2011 12:39 am Reply with quote    Back to top    

We have WTX 8.3 software license , and we have datastage 8.1 software ,
its very unlikely to buy DS 8.5 ,, could you please advise ,

Is it possible to use mapping created in WTX 8.3 and then use the same map in Ds 8.1 software ?
Rate this response:  
Not yet rated
eostic

Premium Poster



Group memberships:
Premium Members

Joined: 17 Oct 2005
Posts: 3773

Points: 30298

Post Posted: Mon Mar 14, 2011 5:38 am Reply with quote    Back to top    

Tell us more... There isn't enough information to make any kind of conclusion..... let us know things like:

a) how big are the INDIVIDUAL xml documents. The size of a SINGLE document is the key here.
b) What "exactly" are you doing with them? Reading, transforming and loading into a data warehouse? An operational system? Other xml documents? Tell us all the possibilities.
c) You noted once a week.....how much time do you have in that window? One hour? 3?
d) How are you receiving them? In one big subdirectory with a million xml's to be loaded at once? Via messaging throughout the entire weekend, trickly in at random?

There are probably more, but those questions will be a good start.

Ernie

_________________
Ernie Ostic

blogit!
Open IGC is Here!
Rate this response:  
Not yet rated
vmcburney

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup

Joined: 23 Jan 2003
Posts: 3582
Location: Australia, Melbourne
Points: 27998

Post Posted: Mon Mar 14, 2011 5:12 pm Reply with quote    Back to top    

Here are your options:
- DataStage TX has a good XML stage for reading or writing XML, however it means you will need to install both DataStage and DataStage TX and get to know both products. There is a DataStage TX plugin for DataStage jobs so the two products can share data within a job.
- DataStage 8.1 has a poor XML Pack that does not handle high data volumes, is slow to validation schemas and struggles with complex schemas. For high volumes I would avoid and consider using TX or preprocessing the file to convert it to relational data either prior to DataStage or using a DataStage Sequential file stage.
- DataStage 8.5 has a new XML Assembly which fixes most of the problems with XML processing in DataStage and is probably a better option than the previous two. It handles very high volumes and has parallel processing capabilities such as mutliple XML readers. You should be able to upgrade from 8.1 to 8.5 for free (as long as you have paid up maintenance) and the upgrade is relatively easy. Version 8.5 is easier to install than 8.1. The XML Assembly is not yet bundled into the 8.5 installer but is an added component for download from IBM Fix Central that is installed after 8.5 is installed.


You need to experiment to see how you can process the volume of XML files. In 8.5 you may be able to process all XML files in a directory at once, or merge the files into one large XML file.

_________________
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn: Vincent McBurney LinkedIn
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 42621
Location: Denver, CO
Points: 219439

Post Posted: Mon Mar 14, 2011 5:56 pm Reply with quote    Back to top    

I'm going to pipe in here from personal experience and clarify something both Vincent and Ernie have mentioned:

Vincent wrote:
DataStage 8.1 has a poor XML Pack that does not handle high data volumes

Ernie wrote:
how big are the INDIVIDUAL xml documents. The size of a SINGLE document is the key here.

IMHO, Ernie's question is key - "high data volumes" can be easily handled even in the Server product if we are talking small, easily digestable XML files. I've routinely done .5 million at a time so don't feel the additional volume you are talking about should be a problem either... again providing that the size of a SINGLE xml document is small. Now, if "high data volumes" mean huge files, then I'll agree with Vincent. I'm sure that, regardless, 8.5 would do whatever needs doing more better. And TX would handle things just fine too, I'd wager. Wink

_________________
-craig

And I'm hovering like a fly, waiting for the windshield on the freeway...
Rate this response:  
Not yet rated
mohan



Group memberships:
Premium Members

Joined: 10 Aug 2004
Posts: 8

Points: 97

Post Posted: Wed Mar 16, 2011 1:36 am Reply with quote    Back to top    

Sorry folks for late reply ,

1.how big are the INDIVIDUAL xml documents. The size of a SINGLE document is the key here.

I am still yet to recieve the sample xml files , but they are telling it would some times around 200 mb or more. But those kind of huge files are less, than smaller sized xml files.

2.What "exactly" are you doing with them? Reading, transforming and loading into a data warehouse? An operational system? Other xml documents? Tell us all the possibilities.
Yes , we would be reading ,transforming and loading to data warehouse. The xml files are sent by m/c sold by my client , we need to track which m/c sent that info and what kind of info they have sent.


3.You noted once a week.....how much time do you have in that window? One hour? 3?
Yes , Those xml are recieved by other team and we recieve from them once in week, as of now SLA is not decided ,but sooner the better ,since we would not be having an explicit server to load and we will reuse server which are not used during weekend.

4) How are you receiving them? In one big subdirectory with a million xml's to be loaded at once? Via messaging throughout the entire weekend, trickly in at random?
Yes , in one big directory to be loaded once , though xml files are recieved dialy by other team ,but they are not agreeing to provide xml files on real time basis (may be a political statement by them ).

5. We have wtx 8.3 license , and 8.1 DS license, to be frank , i am yet to learn about WTX and how to use it.I just got to know , we have license and we need to use WTX
Rate this response:  
Not yet rated
eostic

Premium Poster



Group memberships:
Premium Members

Joined: 17 Oct 2005
Posts: 3773

Points: 30298

Post Posted: Wed Mar 16, 2011 4:47 am Reply with quote    Back to top    

Everything points to DataStage in your current release except the "200meg and larger" xml documents. DataStage does an excellent job of "shredding" xml documents into their relational counterparts. Most often, the tabular extraction and "dynamic normalization" of the hierarchy works quite naturally for the model of a target rdbms. At the very least, you will have rows of data coming from various nodes of the xml document and can easily manipulate them for rdbms tables.

Find out how real and how often the 200m documents arrive and see if you can get one. It's not a fixed value (the max). It could be 500m in your case --- it depends on the document, the population of the elements and attributes, how big those element names are, etc. etc. Get several and see if/when they blow up in DS.

TX uses a more direct reading method, as does 8.5. Get up to 8.5 DS if you can, or else consider MapStage (DataStage invoking a specific TX Map just for the reading part).

Also determine who is going to be supporting this. If you have 20 TX developers already, and no one who knows DS, then it's easy too (or if you have 20 DS Developers, etc.). Consider the long term maintenance and available skills as you make your decision.

Ernie

_________________
Ernie Ostic

blogit!
Open IGC is Here!
Rate this response:  
tminelgin



Group memberships:
Premium Members

Joined: 19 Oct 2010
Posts: 13

Points: 111

Post Posted: Sat Aug 20, 2011 6:20 pm Reply with quote    Back to top    

We are currently trying to use the 8.5 fix pack XML stage with a very large XSD. By large I mead 9700 plus xpaths and over 65 or so objects. (patient admin, lab tests, results etc) in a modified HL7 format. Many of these entities go down several levels and can hold multiple values. IBM says there is a 2000 node (?) limit that causes the parsing to be 'chunked'. The data will be sent via XML messages (never a full set) and needs to end up in a EDW. There will be about 350 lookups of 'anchor' data to be processed as well as lookups to existing or lookup/create for new entities. As a DataStage developer I know I can't think in batch processing terms but I do think that using WTX to parse and send the messages to different message queues and then leveraging the ETL power of DataStage seems like a plausible approach. We also want to maintain transactional integrity through some sort of transaction grouping, which I read about in the documentation, but haven't figured out how to use yet. Are we biting off more than we can chew?

_________________
T. M.
Rate this response:  
Not yet rated
eostic

Premium Poster



Group memberships:
Premium Members

Joined: 17 Oct 2005
Posts: 3773

Points: 30298

Post Posted: Sun Aug 21, 2011 4:17 pm Reply with quote    Back to top    

From what you've told us thus far, I don't think so (re: are you biting off more than you can chew)....but more detail is needed....the size of the xsd will impact your "design work", but has nothing ...

_________________
Ernie Ostic

blogit!
Open IGC is Here!
Rate this response:  
Not yet rated
tminelgin



Group memberships:
Premium Members

Joined: 19 Oct 2010
Posts: 13

Points: 111

Post Posted: Mon Aug 22, 2011 7:40 am Reply with quote    Back to top    

the XSD document itself has a size of 248kb and the XML message will be coming from the message broker. The messages will be coming in constantly from 2500 facilities and I am sure there will be 100s of thousands of 'instance' documents for each patient object being sent. I have been unable to save the entire XSD using the meta data importer and am now attempting to break it up into its components.

_________________
T. M.
Rate this response:  
Not yet rated
eostic

Premium Poster



Group memberships:
Premium Members

Joined: 17 Oct 2005
Posts: 3773

Points: 30298

Post Posted: Mon Aug 22, 2011 11:53 am Reply with quote    Back to top    

Call your support provider...there is a patch beyond 8.5 fp1 to the Library Manager for handling really big xsd's.... (btw --- the new 8.5 stage does not use the xml metadata importer, but instead bri ...

_________________
Ernie Ostic

blogit!
Open IGC is Here!
Rate this response:  
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours