Brain Teaser...Complex Scneario

Nageshsunkoji · Post by **Nageshsunkoji** » Fri Apr 16, 2010 6:08 pm

Hi DSXians,

I came up with a Brain Teaser on Datastage. This problem is bugging us, hope the Datastage world will help me.

The Scenario is, my source data have a constrain on the Length of the String Field (10), if it crosses that limit, it is storing in a new row with the same primary keys by adding a Sequence number it. See the example below:

SOURCE_D KEY1 KEY2 SEQ_I String

2010-04-10 100 200 1 HELLO HOWRU
2010-04-10 100 200 2 I AM FINE

In my destination warehouse table, I am concatenating the String from SEQ_I 1 & 2. I need to see this data as

KEY1 KEY2 String START_D END_D

100 200 HELLO HOWRU I AM FINE 2010-04-10 9999-12-31

The above scenario is a simple scenario. Now I have scenario like below and all the data whatever I am mentioning is coming in a single source file and it is a History Load of 2 months at 1 time.

SOURCE_D KEY1 KEY2 SEQ_I String

2010-04-10 100 200 1 HELLO HOWRU
2010-04-10 100 200 2 I AM FINE

2010-04-12 100 200 3 WAT ABT YOU?

2010-04-14 100 200 4 ME FINE

2010-04-16 100 200 2 I AM NOT FINE

Now, I need my output should be show in warehouse as mentioned below.

KEY1 KEY2 String START_D END_D

100 200 HELLO HOWRU I AM FINE 2010-04-10 2010-04-11

100 200 HELLO HOWRU I AM FINE WAT ABT YOU? 2010-04-12 2010-04-13

100 200 HELLO HOWRU I AM FINE WAT ABT YOU? ME FINE 2010-04-14 2010-04-15

100 200 HELLO HOWRU I AM NOT FINE WAT ABT YOU? ME FINE 2010-04-16 9999-12-31

I tried many ways by using SQL, by using stage variables, I am not succeeded. Only thing you have to remember is everything is coming in one single file and we have to generate 4 output records as mentioned above.

Please do let me know, your inputs on this problem.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 16, 2010 7:45 pm

It seems you me that you're needing a vertical pivot and starting a new record when the sequence number resets to 1.

This is easily accomplished with a Sort stage, a Transformer stage and a Remove Duplicates stage. The technique can be found on DSXchange; search for vertical pivot.

zulfi123786 · Post by **zulfi123786** » Sat Apr 17, 2010 1:57 am

As Ray Suggested you can use a sort stage and sort over the key columns and record start date then use a transformer with a pair of stage variables in such a way that you have both the current record sequence number and previous record sequence number then use two stage variable (svr1,svr2) ,

svr2:- keep the description of the latest record where sequence number was '1'.

svr1:-
1) If current record seq_number=1 then its description
2) If current record seq_number > Previous record seq_number then svr1=svr1:current_description
3) If current record seq_number < Previous record seq_number then svr1=svr2:current_description

Hope this helps

keshav0307 · Post by **keshav0307** » Sat Apr 17, 2010 4:47 am

i think this could be designed quit easily in server job, by writing into a hashfile and do lookup the same hash file.

ray.wurlod · Post by **ray.wurlod** » Sat Apr 17, 2010 4:04 pm

Yes it could.

Only it's a hashed file, not a hash file.

Kryt0n · Post by **Kryt0n** » Sun Apr 18, 2010 8:11 pm

What you ideally want is an array... but not sure how you could hold such in a Parallel job... there is the vector but I haven't figured out how that could be used as part of a transform stage...

However, you could try make your own array as long as you know of a character that won't exist in your field (maybe ~ or |) or you could always filter such characters out...

At it's simplest, use a stage variable to hold the current text and one to hold the number of fields within this text.
After first row

Code: Select all

 Text: HELLO HOWRU
Count: 1

For the next row, check if your SEQ_I <= Count
If > append <delim>String
Else Replace SEQ_I numbered text field in string with new text (this will probably need to be a set of Field, Index and Substring calls)

After second row

Code: Select all

 Text: HELLO HOWRU<delim>I AM FINE
Count: 2

By the time you get to 2010-04-16 you should have

Code: Select all

Text: HELLO HOWRU<delim>I AM FINE<delim>WAT ABT YOU?<delim>ME FINE
Count: 4

So when you get to "2 I AM NOT FINE"
you do a

Code: Select all

svX: Index(Field(<Text>, <SEQ_I>)) - SEQ_I being 2 in this case
svY: Substr(1,svX)<new text>substr(Field(<Text>, SEQ_I+1)-1)

On output, you could either push out every record and then do a remove duplicates or only output at each date change replacing the <delim> with space.

An idea anyway, may have to play around a bit but hopefully the general idea is there... or find out how to use vectors... (I see the Function list provides an ElementAt but no indication how you actually populate the vector in the first place)

ray.wurlod · Post by **ray.wurlod** » Sun Apr 18, 2010 9:00 pm

No need for vector/array. Just concatenate to the existing string if the indicator is greater than one, or use the new string if the indicator is 1.
The Remove Duplicates stage captures the last of each group. That, along with sorted data, is all you need.

Kryt0n · Post by **Kryt0n** » Sun Apr 18, 2010 9:24 pm

The way I am reading it, that is not what the OP wants. Any subsequent SEQ_I that is equal to a previous SEQ_I should replace the text of that record.

Using their example
Should a

Code: Select all

2010-04-20 100 200 1 How ya going?

appear after those already given it should result in

Code: Select all

100 200 How ya going? I AM NOT FINE WAT ABT YOU? ME FINE 2010-04-20 9999-12-31

and not restart the sequence, nor should it be the only output.

If they wanted only one record with the most recent activity for each SEQ_I then I can see a sort/transform/dedup working but this has more requirements than what I can see such a process resolving.... but then as per the great Homer Simpson "Every time I learn something new it pushes something old out of my brain" so maybe my imagination hasn't twigged the bit that pushes that method in to place

ray.wurlod · Post by **ray.wurlod** » Sun Apr 18, 2010 9:52 pm

Yes, I can see that as a reading, and a variable-number-of-elements vector is probably a good solution in that case.