DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
anajitKS



Group memberships:
Premium Members

Joined: 18 Dec 2014
Posts: 28
Location: Kansas City
Points: 544

Post Posted: Mon Apr 24, 2017 4:04 pm Reply with quote    Back to top    

DataStage® Release: 9x
Job Type: Parallel
OS: Unix
Additional info: HTML string does not have fixed format.
I have a requirement to extract data from HTML string. Is there a good/easy way to achieve it using DataStage?

Any suggestion is appreciated.

_________________
ABHIJIT DUTTA
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 41975
Location: Denver, CO
Points: 215434

Post Posted: Mon Apr 24, 2017 6:57 pm Reply with quote    Back to top    

Seems to me the first answer is "depends". Can you post an example of the HTML and what data you are trying to extract from it, please?

_________________
-craig

<this space for rent>
Rate this response:  
Not yet rated
anajitKS



Group memberships:
Premium Members

Joined: 18 Dec 2014
Posts: 28
Location: Kansas City
Points: 544

Post Posted: Tue Apr 25, 2017 8:12 am Reply with quote    Back to top    

Here is one example

<div class='container-fluid custPDPBucketContainer'><div class='row'><div class='col-md-12'><div class='row'><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'> ITEM NUMBER </div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-md-11'><span itemprop="sku"> 1445804 </span></div></div><div class='row custPDPBucketHeader'><div class='col-md-12'> STONE DETAILS </div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'> Minimum Carat Total Weight: </div><div class='col-xs-4 col-md-4'> 1 1/8 ctw (1.11 - 1.19) </div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'> Stone Type: </div><div class='col-xs-4 col-md-4'> Diamond </div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Shape:</div><div class='col-xs-4 col-md-4'>Round</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Color:</div><div class='col-xs-4 col-md-4'>IJ</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Clarity:</div><div class='col-xs-4 col-md-4'>I3</div></div></div><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>METAL</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Type:</div><div class='col-xs-4 col-md-4'>Gold</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Color:</div><div class='col-xs-4 col-md-4'>Yellow</div></div></div></div></div></div></div>

From this HTML
we have to be able to extract 'ITEM NUMBER' 1445804 'STONE DETAILS'
'Minimum Carat Total Weight:' '1 1/8 ctw (1.11 - 1.19)' 'Stone Type:' and so on.

_________________
ABHIJIT DUTTA
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 41975
Location: Denver, CO
Points: 215434

Post Posted: Tue Apr 25, 2017 9:56 am Reply with quote    Back to top    

The first technical term that comes to mind is... yuck. Confused

I don't see a good way to "manually" do this but perhaps others may have some suggestions. I would imagine you may need to leverage one of the many "HTML Parsers" out there or perhaps write something in C++ or Java. [shrug]

_________________
-craig

<this space for rent>
Rate this response:  
Not yet rated
UCDI



Group memberships:
Premium Members

Joined: 21 Mar 2016
Posts: 218

Points: 2266

Post Posted: Tue Apr 25, 2017 10:43 am Reply with quote    Back to top    

if the xml format is reliably identical for each record, it can be done with simple substring logic.

For example, if you could seek
<span itemprop="sku">
to find the item number, and you can do it for all records, that would be simple.

If you can't, you have to parse the whole mess. Datastage has XML tools which can pull it apart into columns, if you want to try to set that up (hierarchical stage and xml stages) if you have access to those. If not, java, VB stages or C routine all are options.

I always attack XML with string processing first. If I can do what I need to do with dumb string matching, that is great. If not, I have to apply another method, and that varies depending on how annoying the xml format is. It does not have to have a totally fixed format to use string processing attacks. It just needs to have the tags that you want in a format that you can find "<tag>data", even if other tags are skipped or inserted, that is ok. The trouble is if you have <tag><optional stuff or very deep nested junk> data format AND the optional stuff is too complicated to reliably locate the data after it.

10 min of analysis on the xml schema and example files should let you know if string searching is even remotely possible or not. If not, its a chore.


Last edited by UCDI on Tue Apr 25, 2017 10:47 am; edited 2 times in total
Rate this response:  
Not yet rated
anajitKS



Group memberships:
Premium Members

Joined: 18 Dec 2014
Posts: 28
Location: Kansas City
Points: 544

Post Posted: Tue Apr 25, 2017 10:45 am Reply with quote    Back to top    

chulett wrote:
The first technical term that comes to mind is... yuck. Confused

I had the same reaction when it came up as a requirement. I just wanted to find out if anyone has any suggestions.

_________________
ABHIJIT DUTTA
Rate this response:  
Not yet rated
chulett

Premium Poster


since January 2006

Group memberships:
Premium Members, Inner Circle, Server to Parallel Transition Group

Joined: 12 Nov 2002
Posts: 41975
Location: Denver, CO
Points: 215434

Post Posted: Tue Apr 25, 2017 11:06 am Reply with quote    Back to top    

Of course, and you have a couple now.

How much does it matter that it isn't really XML but rather HTML? I was wondering if you could parse it as XML but you would need to make it "well formed" before hand I would think.

_________________
-craig

<this space for rent>
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours