Need suggestion to exctract data from HTML string

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
anajitKS
Premium Member
Premium Member
Posts: 28
Joined: Thu Dec 18, 2014 7:57 pm
Location: Kansas City

Need suggestion to exctract data from HTML string

Post by anajitKS »

I have a requirement to extract data from HTML string. Is there a good/easy way to achieve it using DataStage?

Any suggestion is appreciated.
ABHIJIT DUTTA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Seems to me the first answer is "depends". Can you post an example of the HTML and what data you are trying to extract from it, please?
-craig

"You can never have too many knives" -- Logan Nine Fingers
anajitKS
Premium Member
Premium Member
Posts: 28
Joined: Thu Dec 18, 2014 7:57 pm
Location: Kansas City

Post by anajitKS »

Here is one example

<div class='container-fluid custPDPBucketContainer'><div class='row'><div class='col-md-12'><div class='row'><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>ITEM NUMBER</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-md-11'><span itemprop="sku">1445804</span></div></div><div class='row custPDPBucketHeader'><div class='col-md-12'>STONE DETAILS</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Minimum Carat Total Weight:</div><div class='col-xs-4 col-md-4'>1 1/8 ctw (1.11 - 1.19)</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Type:</div><div class='col-xs-4 col-md-4'>Diamond</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Shape:</div><div class='col-xs-4 col-md-4'>Round</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Color:</div><div class='col-xs-4 col-md-4'>IJ</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Clarity:</div><div class='col-xs-4 col-md-4'>I3</div></div></div><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>METAL</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Type:</div><div class='col-xs-4 col-md-4'>Gold</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Color:</div><div class='col-xs-4 col-md-4'>Yellow</div></div></div></div></div></div></div>

From this HTML
we have to be able to extract 'ITEM NUMBER' 1445804 'STONE DETAILS'
'Minimum Carat Total Weight:' '1 1/8 ctw (1.11 - 1.19)' 'Stone Type:' and so on.
ABHIJIT DUTTA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The first technical term that comes to mind is... yuck. :?

I don't see a good way to "manually" do this but perhaps others may have some suggestions. I would imagine you may need to leverage one of the many "HTML Parsers" out there or perhaps write something in C++ or Java. [shrug]
-craig

"You can never have too many knives" -- Logan Nine Fingers
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

if the xml format is reliably identical for each record, it can be done with simple substring logic.

For example, if you could seek
<span itemprop="sku">
to find the item number, and you can do it for all records, that would be simple.

If you can't, you have to parse the whole mess. Datastage has XML tools which can pull it apart into columns, if you want to try to set that up (hierarchical stage and xml stages) if you have access to those. If not, java, VB stages or C routine all are options.

I always attack XML with string processing first. If I can do what I need to do with dumb string matching, that is great. If not, I have to apply another method, and that varies depending on how annoying the xml format is. It does not have to have a totally fixed format to use string processing attacks. It just needs to have the tags that you want in a format that you can find "<tag>data", even if other tags are skipped or inserted, that is ok. The trouble is if you have <tag><optional stuff or very deep nested junk> data format AND the optional stuff is too complicated to reliably locate the data after it.

10 min of analysis on the xml schema and example files should let you know if string searching is even remotely possible or not. If not, its a chore.
Last edited by UCDI on Tue Apr 25, 2017 10:47 am, edited 2 times in total.
anajitKS
Premium Member
Premium Member
Posts: 28
Joined: Thu Dec 18, 2014 7:57 pm
Location: Kansas City

Post by anajitKS »

chulett wrote:The first technical term that comes to mind is... yuck. :?
I had the same reaction when it came up as a requirement. I just wanted to find out if anyone has any suggestions.
ABHIJIT DUTTA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Of course, and you have a couple now.

How much does it matter that it isn't really XML but rather HTML? I was wondering if you could parse it as XML but you would need to make it "well formed" before hand I would think.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply