IA data rule to check for valid XML

This forum contains ProfileStage posts and now focuses at newer versions Infosphere Information Analyzer.

Moderators: chulett, rschirm

Post Reply
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

IA data rule to check for valid XML

Post by qt_ky »

Any ideas on how to build an IA data rule to check if a column contains valid XML (properly constructed)? I did not see anything related to XML in the pre-built rule definitions.
Choose a job you love, and you will never have to work a day in your life. - Confucius
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Theoretically.....in java? Assuming you can invoke either a well formed checker or a schema validator.

....or else use a ds job itself with an exceptions stage downstream from a hierarchical stage.....not exactly the same but may accomplish the task.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I'm not really a big XML fan or a java developer; just looking for a least effort path to validate XML that is stored in a source database column (without any XSD). The data source is SQL Server, which we have read-only access to via ODBC on Information Server.

What are the options on AIX? Options:

1. I did some searching and found an Apache Xerces-C++ validating XML parser. http://xerces.apache.org/xerces-c/index.html

2. We do have DataStage. I have not done a lot with XML in DataStage. Does it require an XSD or can it validate XML by itself?

3. I assume most relational database systems have XML data type support. Could we attempt to load the source data into a temporary table in the IADB database (DB2 10.5)? Is that a legitimate use of the bundled DB2?

4. Other options?
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It's been a long time but from what I recall it needs the xsd in order to have a clue how to "validate" the xml. It might depends on what that means to the OP as simply checking to see if it is well-formed doesn't need it but would need to happen before something tries to process it or the parser will fail. Once you are sure it is well-formed then the gory details of the elements can be validated against the xsd by an ETL tool.

I have vague recollections of using Xerces (no, not the king of Persia!) when we had to handle large (as in hundreds of megabytes) XML files for Google or maybe it was something from the Java Beans collection? Way too long ago but I remember it had to be a stream-based tool as the files we were processing were "too large" to load all up in memory. Much preferred some of our other sources who gave us a crap-ton of teeny little files.

Not an XML fan either, but I work with it when forced to. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

I would also prefer not to use java if I can avoid it.

You may be successful with the xmlInput stage. It doesnt need an xsd, and performs a well formed xerces based check when getting started.

You will have to experiment a little to see what its validator will capture, or not, if only checking for well formed-ness.

...and use a server job. It would be perfect for this. No issues with data lengths and the Folder Stage was made for this....

Folder....to xmlInput to......(initially) Sequential. Load the built-In folder tabledef and point the stage to your xml subdir and then the actual file. You can change to wildcard later.

Point the xmlInput stage to the file contents column at the input link, and choose xml content on the radio button....

On the output link......not sure what is best....probably a longvarchar, single column, with something like /<rootElementname>/ in the Description prooerty of that single column. Even just a slash might work.

.....get that to work with a GOOD document. It doesnt matter what it delivers...just that it sends out one row.

Now look at the validation options in the stage and also the reject link possibilities. Add a link and name it in the pull-down here. Choose something non fatal.

Make sure your good xml still works and then start feeding it various kinds of bad xml....

There's more, but that should get you close.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

One very important key here is to know in advance what you are looking for in terms of validation?

You mentioned that there is no xsd..... formal xml "validation" means determining if a document matches its xsd design.

So.....do you need checking just to see if the xml is readable xml? ...that it has the correct elements and attributes? That they are the right length or size or value?

That will be important in determining further what the solution will look like, and how deep you get into the solution above.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Thank you so much for that outline--big time saver! I will follow that and see how far I can get.

I am sticking with just the surface level initially (what you called well-formed-ness) then will see where it goes with the customer.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply