I have a requirement to extract data from HTML string. Is there a good/easy way to achieve it using DataStage?
Any suggestion is appreciated.
Need suggestion to exctract data from HTML string
Moderators: chulett, rschirm, roy
Need suggestion to exctract data from HTML string
ABHIJIT DUTTA
Here is one example
<div class='container-fluid custPDPBucketContainer'><div class='row'><div class='col-md-12'><div class='row'><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>ITEM NUMBER</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-md-11'><span itemprop="sku">1445804</span></div></div><div class='row custPDPBucketHeader'><div class='col-md-12'>STONE DETAILS</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Minimum Carat Total Weight:</div><div class='col-xs-4 col-md-4'>1 1/8 ctw (1.11 - 1.19)</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Type:</div><div class='col-xs-4 col-md-4'>Diamond</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Shape:</div><div class='col-xs-4 col-md-4'>Round</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Color:</div><div class='col-xs-4 col-md-4'>IJ</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Clarity:</div><div class='col-xs-4 col-md-4'>I3</div></div></div><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>METAL</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Type:</div><div class='col-xs-4 col-md-4'>Gold</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Color:</div><div class='col-xs-4 col-md-4'>Yellow</div></div></div></div></div></div></div>
From this HTML
we have to be able to extract 'ITEM NUMBER' 1445804 'STONE DETAILS'
'Minimum Carat Total Weight:' '1 1/8 ctw (1.11 - 1.19)' 'Stone Type:' and so on.
<div class='container-fluid custPDPBucketContainer'><div class='row'><div class='col-md-12'><div class='row'><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>ITEM NUMBER</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-md-11'><span itemprop="sku">1445804</span></div></div><div class='row custPDPBucketHeader'><div class='col-md-12'>STONE DETAILS</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Minimum Carat Total Weight:</div><div class='col-xs-4 col-md-4'>1 1/8 ctw (1.11 - 1.19)</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Type:</div><div class='col-xs-4 col-md-4'>Diamond</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Stone Shape:</div><div class='col-xs-4 col-md-4'>Round</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Color:</div><div class='col-xs-4 col-md-4'>IJ</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Average Clarity:</div><div class='col-xs-4 col-md-4'>I3</div></div></div><div class='col-md-6'><div class='row custPDPBucketHeader'><div class='col-md-12'>METAL</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Type:</div><div class='col-xs-4 col-md-4'>Gold</div></div><div class='row custPDPBucketRow'><div class='col-md-1'></div><div class='col-xs-7 col-md-7'>Metal Color:</div><div class='col-xs-4 col-md-4'>Yellow</div></div></div></div></div></div></div>
From this HTML
we have to be able to extract 'ITEM NUMBER' 1445804 'STONE DETAILS'
'Minimum Carat Total Weight:' '1 1/8 ctw (1.11 - 1.19)' 'Stone Type:' and so on.
ABHIJIT DUTTA
The first technical term that comes to mind is... yuck.
I don't see a good way to "manually" do this but perhaps others may have some suggestions. I would imagine you may need to leverage one of the many "HTML Parsers" out there or perhaps write something in C++ or Java. [shrug]
I don't see a good way to "manually" do this but perhaps others may have some suggestions. I would imagine you may need to leverage one of the many "HTML Parsers" out there or perhaps write something in C++ or Java. [shrug]
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
if the xml format is reliably identical for each record, it can be done with simple substring logic.
For example, if you could seek
<span itemprop="sku">
to find the item number, and you can do it for all records, that would be simple.
If you can't, you have to parse the whole mess. Datastage has XML tools which can pull it apart into columns, if you want to try to set that up (hierarchical stage and xml stages) if you have access to those. If not, java, VB stages or C routine all are options.
I always attack XML with string processing first. If I can do what I need to do with dumb string matching, that is great. If not, I have to apply another method, and that varies depending on how annoying the xml format is. It does not have to have a totally fixed format to use string processing attacks. It just needs to have the tags that you want in a format that you can find "<tag>data", even if other tags are skipped or inserted, that is ok. The trouble is if you have <tag><optional stuff or very deep nested junk> data format AND the optional stuff is too complicated to reliably locate the data after it.
10 min of analysis on the xml schema and example files should let you know if string searching is even remotely possible or not. If not, its a chore.
For example, if you could seek
<span itemprop="sku">
to find the item number, and you can do it for all records, that would be simple.
If you can't, you have to parse the whole mess. Datastage has XML tools which can pull it apart into columns, if you want to try to set that up (hierarchical stage and xml stages) if you have access to those. If not, java, VB stages or C routine all are options.
I always attack XML with string processing first. If I can do what I need to do with dumb string matching, that is great. If not, I have to apply another method, and that varies depending on how annoying the xml format is. It does not have to have a totally fixed format to use string processing attacks. It just needs to have the tags that you want in a format that you can find "<tag>data", even if other tags are skipped or inserted, that is ok. The trouble is if you have <tag><optional stuff or very deep nested junk> data format AND the optional stuff is too complicated to reliably locate the data after it.
10 min of analysis on the xml schema and example files should let you know if string searching is even remotely possible or not. If not, its a chore.
Last edited by UCDI on Tue Apr 25, 2017 10:47 am, edited 2 times in total.