Brain Teaser...Complex Scneario

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Brain Teaser...Complex Scneario

Post by Nageshsunkoji »

Hi DSXians,

I came up with a Brain Teaser on Datastage. This problem is bugging us, hope the Datastage world will help me.

The Scenario is, my source data have a constrain on the Length of the String Field (10), if it crosses that limit, it is storing in a new row with the same primary keys by adding a Sequence number it. See the example below:

SOURCE_D KEY1 KEY2 SEQ_I String

2010-04-10 100 200 1 HELLO HOWRU
2010-04-10 100 200 2 I AM FINE

In my destination warehouse table, I am concatenating the String from SEQ_I 1 & 2. I need to see this data as

KEY1 KEY2 String START_D END_D

100 200 HELLO HOWRU I AM FINE 2010-04-10 9999-12-31

The above scenario is a simple scenario. Now I have scenario like below and all the data whatever I am mentioning is coming in a single source file and it is a History Load of 2 months at 1 time.

SOURCE_D KEY1 KEY2 SEQ_I String

2010-04-10 100 200 1 HELLO HOWRU
2010-04-10 100 200 2 I AM FINE

2010-04-12 100 200 3 WAT ABT YOU?

2010-04-14 100 200 4 ME FINE

2010-04-16 100 200 2 I AM NOT FINE

Now, I need my output should be show in warehouse as mentioned below.

KEY1 KEY2 String START_D END_D

100 200 HELLO HOWRU I AM FINE 2010-04-10 2010-04-11

100 200 HELLO HOWRU I AM FINE WAT ABT YOU? 2010-04-12 2010-04-13

100 200 HELLO HOWRU I AM FINE WAT ABT YOU? ME FINE 2010-04-14 2010-04-15

100 200 HELLO HOWRU I AM NOT FINE WAT ABT YOU? ME FINE 2010-04-16 9999-12-31

I tried many ways by using SQL, by using stage variables, I am not succeeded. Only thing you have to remember is everything is coming in one single file and we have to generate 4 output records as mentioned above.

Please do let me know, your inputs on this problem.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It seems you me that you're needing a vertical pivot and starting a new record when the sequence number resets to 1.

This is easily accomplished with a Sort stage, a Transformer stage and a Remove Duplicates stage. The technique can be found on DSXchange; search for vertical pivot.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

As Ray Suggested you can use a sort stage and sort over the key columns and record start date then use a transformer with a pair of stage variables in such a way that you have both the current record sequence number and previous record sequence number then use two stage variable (svr1,svr2) ,

svr2:- keep the description of the latest record where sequence number was '1'.

svr1:-
1) If current record seq_number=1 then its description
2) If current record seq_number > Previous record seq_number then svr1=svr1:current_description
3) If current record seq_number < Previous record seq_number then svr1=svr2:current_description

Hope this helps
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

i think this could be designed quit easily in server job, by writing into a hashfile and do lookup the same hash file.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes it could.

Only it's a hashed file, not a hash file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

What you ideally want is an array... but not sure how you could hold such in a Parallel job... there is the vector but I haven't figured out how that could be used as part of a transform stage...

However, you could try make your own array as long as you know of a character that won't exist in your field (maybe ~ or |) or you could always filter such characters out...

At it's simplest, use a stage variable to hold the current text and one to hold the number of fields within this text.
After first row

Code: Select all

 Text: HELLO HOWRU
Count: 1
For the next row, check if your SEQ_I <= Count
If > append <delim>String
Else Replace SEQ_I numbered text field in string with new text (this will probably need to be a set of Field, Index and Substring calls)

After second row

Code: Select all

 Text: HELLO HOWRU<delim>I AM FINE
Count: 2
By the time you get to 2010-04-16 you should have

Code: Select all

Text: HELLO HOWRU<delim>I AM FINE<delim>WAT ABT YOU?<delim>ME FINE
Count: 4
So when you get to "2 I AM NOT FINE"
you do a

Code: Select all

svX: Index(Field(<Text>, <SEQ_I>)) - SEQ_I being 2 in this case
svY: Substr(1,svX)<new text>substr(Field(<Text>, SEQ_I+1)-1)
On output, you could either push out every record and then do a remove duplicates or only output at each date change replacing the <delim> with space.

An idea anyway, may have to play around a bit but hopefully the general idea is there... or find out how to use vectors... (I see the Function list provides an ElementAt but no indication how you actually populate the vector in the first place)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No need for vector/array. Just concatenate to the existing string if the indicator is greater than one, or use the new string if the indicator is 1.
The Remove Duplicates stage captures the last of each group. That, along with sorted data, is all you need.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

The way I am reading it, that is not what the OP wants. Any subsequent SEQ_I that is equal to a previous SEQ_I should replace the text of that record.

Using their example
Should a

Code: Select all

2010-04-20 100 200 1 How ya going?
appear after those already given it should result in

Code: Select all

100 200 How ya going? I AM NOT FINE WAT ABT YOU? ME FINE 2010-04-20 9999-12-31 
and not restart the sequence, nor should it be the only output.

If they wanted only one record with the most recent activity for each SEQ_I then I can see a sort/transform/dedup working but this has more requirements than what I can see such a process resolving.... but then as per the great Homer Simpson "Every time I learn something new it pushes something old out of my brain" so maybe my imagination hasn't twigged the bit that pushes that method in to place
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, I can see that as a reading, and a variable-number-of-elements vector is probably a good solution in that case.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply