Hierarchical Stage:sort join issue with parent -child switch

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kurics40
Premium Member
Premium Member
Posts: 61
Joined: Wed Nov 18, 2009 10:01 am

Hierarchical Stage:sort join issue with parent -child switch

Post by kurics40 »

Hi,

A complex XSD is given. Perhaps it was created with copy-paste way, because every node and leaf is optional and unbounded too. The xsd had no solid point. Every leaf has a wrapper node without any attributes. Their minoccur is "0" and maxoccur="unbounded" too. Every attribute is with minoccur=0.


I try to flatten the input xml. I use switches to separate the groups. I use sort join to join together the nodes. I have no problem when I do this with the switches of the first nodes. I have problem when I join a switch which is a child of another switch. I tried many variations and possibilities with a small example. When I run the parallel job sooner or later it fails. I could somehow manage to join with a fix small main node to a 1-N leaf . When I try to do the same with another leaf from the same group it fails.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Usually best to include an actual question in your post. Perhaps even post your XSD.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kurics40
Premium Member
Premium Member
Posts: 61
Joined: Wed Nov 18, 2009 10:01 am

Post by kurics40 »

I have many phenomenon of errors. Nullpointer, Invalid parent cursor, etc

Question: How can I make this work? Is this a bug?

I make switches of the groups and I attempt order join with them.

switches:
IS
ICs
IC
IDs
ID

Switch -------------- scope
///////////////////////////////////////////////////////////////////////////////////////
IS ---------------------------- inputlink
ICs -------------- IS
IC -------------- ICs
IDs -------------- IS
ID -------------- IDs



Order Join:
Join_IS: IS - ICs
Join_ICs: Join_IS - IC
Join_IC : Join_ICs - IC


Code: Select all

<xs:element name="ID">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="ISId" type="xs:string" minOccurs="0"/>
                <xs:element name="GU" type="xs:string" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="IDs">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="ID" minOccurs="0" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="IC">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="ICL" type="xs:string" minOccurs="0"/>
                <xs:element name="ICL2" type="xs:string" minOccurs="0"/>
                <xs:element name="NAME" type="xs:string" minOccurs="0"/>
                <xs:element name="PURPOSE" type="xs:string" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="ICs">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="IC" minOccurs="0" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="IS">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="ISName" type="xs:string" minOccurs="0"/>
                <xs:element ref="IDs"/>
                <xs:element ref="ICs"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="ISs">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="IS" minOccurs="0" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

What is your ultimate goal?

Are you writing xml that is coming from many different relational tables? ...or parsing/reading incoming XML and putting together a result?

Also...spend time reviewing an actual document itself...if everything is entirely unbounded, it often means that "it was the simplest way to quickly generate a necessary xsd" and not reflecting reality. This happens a lot....I've seen scenarios where "date of birth" for a person is occurs unbounded in its own node (maybe there are situations, but not particularly likely). Make certain that it follows a sensible use case.

Joins and regroups are perhaps the hardest part of using the Stage. One thing that can easily go wrong, and lead to the strangest of errors, is if the "nested" nature of your join specifications, your switches, etc. is off.....perhaps referencing a grandparent instead of the proper parent, or a sibling, or any other thing....review the various key specifications VERY carefully as you go thru it.

Further, as I am building something with many Joins, especially when they are iterative and go deep, I create a separate Job for each Join. Get one working perfectly...then save that Job and copy it to another one. Start adding the next Join in the assembly...etc. etc. etc. Until I have what is required (this is assuming that you are WRITING the xml).

If you are reading the xml.....the best practice is "parse and get out". Meaning ....don't do ANY joining or aggregation inside the Assembly... do that downstream in DataStage, where there are dedicated Stages for each of those very important and highly performance-impactful processes.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

eostic wrote:I've seen scenarios where "date of birth" for a person is occurs unbounded in its own node (maybe there are situations, but not particularly likely).
My sister just turned 37 for the 30th year in a row. :lol:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

;)

Good one!
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
kurics40
Premium Member
Premium Member
Posts: 61
Joined: Wed Nov 18, 2009 10:01 am

Post by kurics40 »

My ultimate goal: flatten XML data to multiple records and load it into one staging table. Current solution loads to 24 in the same time.

XML file sample:

Code: Select all

<employee>
      <name>Joe</name>
       <telephones>
              <telephone >11111111</telephone>
              <telephone >22222222</telephone>
      </telephones>
<employee>
Wished records to have:
Joe|11111111
Joe|22222222

I would like to have one staging table load instead of 24. it has 24 subclasses.

Lets say I have two lists.
list1: row1,row2,row3
list2: A,B

Order join:
row1,A
row2,B
row3


intead of real join like:

row1,A
row2,A
row3,A
row1,B
row2,B
row2,B



Why are you saying that join shouldnt be used inside of hierarchical stage? Isn't it reliable?
Or do you think it could be easy to make it wrong?


PS: the task is given. Not my idea.
kurics40
Premium Member
Premium Member
Posts: 61
Joined: Wed Nov 18, 2009 10:01 am

Post by kurics40 »

This is the error message what I get when two nested switches I try to sort join with each other:


where

Code: Select all

group1main minoccurs="0" maxoccurs="unbounded"
   group1  minoccurs="0" maxoccurs="unbounded"
       subgroupmain minoccurs="0" maxoccurs="unbounded"
           subgroup1 minoccurs="0" maxoccurs="unbounded"
and only group1 and subgroup1 have attributes.
Switches:
  • group1
    subgroup1
I tried to order join group1 with subgroup1. I have only problem with nested switches' order join.

Error message:
Hierarchical_Data,0: Fatal Error: 2019-05-02 21:02:24,006 Fatal [Projection-220] [] Invalid parent cursor
com.ibm.e2.core.exceptions.E2IllegalStateException: MissingStartItemCallOnParent frame = 'ProjectRuntime:Op-RT-2[Projection-220]', cursorPath = '{yyy.generated.interfaces.xxxxxx.com}subgroup1'
at com.ibm.e2.core.exceptions.E2IllegalStateException$FactoryImpl.missingStartItemCallOnParent(E2IllegalStateException$FactoryImpl.java:76)
at com.ibm.e2.core.framework.runtime.daapi.WriteCursorImpl.startItem(WriteCursorImpl.java:163)
at com.ibm.e2.core.framework.runtime.generic.traversers.StartItemHandler.startItemsForCursors(StartItemHandler.java:94)
at com.ibm.e2.core.framework.runtime.generic.traversers.StartItemHandler.itemBegin(StartItemHandler.java:62)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEventForHandler(AbstractTraverser.java:668)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEvent(AbstractTraverser.java:534)
at com.ibm.e2.core.framework.runtime.generic.traversers.VectorTraverser.handleCurrentState(VectorTraverser.java:133)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.continueTraversal(AbstractTraverser.java:475)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.startTraversal(AbstractTraverser.java:463)
at com.ibm.e2.core.framework.runtime.generic.traversers.VectorTraverser.itemBegin(VectorTraverser.java:196)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEventForHandler(AbstractTraverser.java:668)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEvent(AbstractTraverser.java:534)
at com.ibm.e2.core.framework.runtime.generic.traversers.ItemTraverser.handleCurrentState(ItemTraverser.java:80)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.continueTraversal(AbstractTraverser.java:475)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.startTraversal(AbstractTraverser.java:463)
at com.ibm.e2.core.framework.runtime.generic.traversers.ItemTraverser.itemBegin(ItemTraverser.java:122)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEventForHandler(AbstractTraverser.java:668)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEvent(AbstractTraverser.java:534)
at com.ibm.e2.core.framework.runtime.generic.traversers.VectorTraverser.handleCurrentState(VectorTraverser.java:133)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.continueTraversal(AbstractTraverser.java:475)
at com.ibm.e2.core.framework.runtime.generic.traversers.VectorTraverser.itemBegin(VectorTraverser.java:192)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEventForHandler(AbstractTraverser.java:668)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEvent(AbstractTraverser.java:534)
at com.ibm.e2.core.framework.runtime.generic.traversers.ItemTraverser.handleCurrentState(ItemTraverser.java:80)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.continueTraversal(AbstractTraverser.java:475)
at com.ibm.e2.core.framework.runtime.generic.traversers.ItemTraverser.itemBegin(ItemTraverser.java:118)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEventForHandler(AbstractTraverser.java:668)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.handleEvent(AbstractTraverser.java:534)
at com.ibm.e2.core.framework.runtime.generic.traversers.VectorTraverser.handleCurrentState(VectorTraverser.java:133)
at com.ibm.e2.core.framework.runtime.generic.traversers.AbstractTraverser.continueTraversal(AbstractTraverser.java:475)
at com.ibm.e2.core.framework.frames.AbstractRuntimeFrame.runTraverser(AbstractRuntimeFrame.java:1458)
at com.ibm.e2.core.framework.frames.AbstractRuntimeFrame.runTraverser(AbstractRuntimeFrame.java:1437)
at com.ibm.e2.core.framework.frames.ProjectRuntime.process(ProjectRuntime.java:388)
at com.ibm.e2.core.framework.runtime.OperatorController.callOperatorProcess(OperatorController.java:341)
at com.ibm.e2.core.framework.runtime.OperatorController.runOperator(OperatorController.java:273)
at com.ibm.e2.core.framework.runtime.OperatorController.doReadyToExecute(OperatorController.java:177)
at com.ibm.e2.core.framework.runtime.OperatorController.runDataStateTransistion(OperatorController.java:132)
at com.ibm.e2.core.framework.runtime.OperatorController.runTransitions(OperatorController.java:88)
at com.ibm.e2.core.framework.runtime.OperatorController.runOperatorStep(OperatorController.java:66)
at com.ibm.e2.core.framework.runtime.scheduler.OperatorTask.runFrame(OperatorTask.java:86)
at com.ibm.e2.core.framework.runtime.scheduler.OperatorTask.execute(OperatorTask.java:40)
at com.ibm.e2.core.framework.runtime.scheduler.AbstractTask.run(AbstractTask.java:29)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
kurics40
Premium Member
Premium Member
Posts: 61
Joined: Wed Nov 18, 2009 10:01 am

Post by kurics40 »

Any wise advice is welcome. Premium or regular.

I am open to use less heavy enterprise product like python to make this work.

Microsoft or IBM....
same
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

A couple of things...

1. the first scenario, name telephone....they can easily be parsed and retrieved all on one single output link, exactly as you have it. You only need a single link for all directly nested nodes. So in your case, it doesn't sound like you have 24 complete "paths" thru the document.... in this case, are you using two output links...one for name and one for telephone?

2. You "might" need 24 output links.... "if"....you expect that there might be names (in the use case above) without ANY telephone number instances (depends on the document and the kind of relationships you have). ...but in contrast, suppose you had a document with 24 nodes....all from the top, in one giant nested path.... all populated --original node...children...grandchildren...greatgrandchildren....etc. They would all be retrieved just fine on one link. No joins....

3. Are you requiring deep detailed xsd based validation?

4. How large are the documents? Multiple terabytes? several k each? somewhere in the middle? (I'm talking here about each "one" document).....and in a particular run, how many do you have (in total volume).... Terabytes worth? Gigabytes worth? hundreds of Meg?

...this last question (#4) is very important regarding your approach.

As for doing the Join outside --- ease of support and use. As powerful as the Hierarchical Stage is, it is not as widely known. There are TONS of things written and documented about best practices for designing and supporting the JOIN stage and related Stages for bringing datasets together... "parse and get out" is basically saying: "keep the transformation activity graphical" and well documented, where the next DataStage user can come along and immediately know what is going on and support/maintain the Job and make edits or performance changes as needed. The "join" capability was primarily built for "writing" complex xml...when you are bringing "in" many relationally based links and needing to combine them to create a nested hierarchical whole. Best practice is to avoid it for parsing. Not needed and just complicates the entire process.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply