Compare input column with Loop cache value in transformer

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
trenicar
Premium Member
Premium Member
Posts: 51
Joined: Thu Sep 18, 2003 4:38 am

Compare input column with Loop cache value in transformer

Post by trenicar »

Hi
I want to use looping in a transformer stage, to aggregate data, and stop a total in the input record

But I need to compare a date held in the input record with one in each of the cached records I am looping through, to calculate the sum. Under certain conditions a record from the cache will be ignored.

Would storing the input value in stage variable be a way round it, or is there a simple solution?
Thanks
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Using a stage variable should work.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
trenicar
Premium Member
Premium Member
Posts: 51
Joined: Thu Sep 18, 2003 4:38 am

Post by trenicar »

Thanks Andy

But is there some syntax that defines the input stream or cache?
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

I'm not certain what you mean. DataStage doesn't really have a concept of "cached" records. Can you explain what you mean by cached records and how you plan to accomplish it?

I had assumed you meant that you were looping through an array of numbers contained in a single record. In that case they would be stored in a loop variable.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

asorrell wrote:DataStage doesn't really have a concept of "cached" records.
I'm assuming they are referring to the transformer cache introduced in 8.5 along with looping and the SaveInputRecord() and GetInputRecord() functions. Yes?
-craig

"You can never have too many knives" -- Logan Nine Fingers
trenicar
Premium Member
Premium Member
Posts: 51
Joined: Thu Sep 18, 2003 4:38 am

Post by trenicar »

Yes that is correct
trenicar
Premium Member
Premium Member
Posts: 51
Joined: Thu Sep 18, 2003 4:38 am

Post by trenicar »

Just to give a bit more background

I have the following data

Code: Select all

key1  key2 date    value
123   111  1/1/12  500
123   113  1/2/12  300
123   115  1/3/12  600
123   117  1/4/12  100

125   121  1/1/12  500
125   133  1/2/12  300

126   165  1/3/12  600
126   167  1/4/12  100
The data is sorted by key1 , date

Key 1 is a set of records that , for each key2 record must be processed in the loop
for each key2 record i must summ the value, only if the date of the loop record is greater than the key2 record being processed

The output should be something like this

Code: Select all

key1  key2 date    value  Sum
123   111  1/1/12  500    1000
123   113  1/2/12  300    700
123   115  1/3/12  600    100
123   117  1/4/12  100    0

125   121  1/1/12  500    300
125   133  1/2/12  300    0

126   165  1/3/12  600    100
126   167  1/4/12  100    0
To my view it is made for using the transformer looping process

Another alternative was to produce a cartesian join on itself,
filter the records that were to be summed and aggregate, then use as lookup and update

This worked fine in the test version, however there are 8m records and it ran out of temp db space.

Which again will be a issue with the looping solution, as it will be doing a lot of processing

So I have another question, i presume that the SaveInputRecord() will save the whole incomming record,
in some cases these key1 sets of records
can be 2000 records, it would presumably make sense to reduce the size of the incomming data to just what is needed to do the processing
rather than the whole record that would have data to be used further down in the process and then merge the results back again

Just so I can understand how the looping in a transformer works, is the any document that describes exactly this process,
there are a number of examples around but none seem to explain the processing logic

Sorry for the long post , but thought I better explain exactly what i am trying to do

[Note - added code blocks to retain formatting in example - Andy]
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Craig - Thanks for reminding me about the Transformer cache, nice link also, very helpful.

Per request - some looping examples:

http://newtonapples.com/datastage-trans ... looping-1/

http://newtonapples.com/datastage-trans ... looping-2/

Several examples in IBM doc:
http://pic.dhe.ibm.com/infocenter/iisin ... oop_1.html
Last edited by asorrell on Sat Dec 07, 2013 10:33 am, edited 1 time in total.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Ok - after sitting here looking at your code for a while, I finally figured out what your example was doing. I don't think you need looping, caching, or any of that.

Sort data by key1, date descending. Hash on key1.

Create these stage variables in this order:
  • svSum initialized to 0
    svRunningTotal initialized to 0
    svKey1 initialized to ""
    svCurrentDate initialized to ""
Derivations as follows:

Code: Select all

svSum:          If inputlink.key1 <> svKey1 Then 0 Else 
                If inputlink.key1 = svKey1 and inputlink.date <> svCurrentDate Then svRunningTotal Else svSum

svRunningTotal: If inputlink.key1 <> svKey1 Then inputlink.value Else inputlink.value + svRunningTotal

svKey1:         inputlink.key1

svCurrentDate:  inputlink.date
Then just output svSum as your "Sum" value in your output record.

To summarize:
1) Reset running total / output sum when key1 changes
2) Only update the sum to be output with a new value from running total when the date changes.

I believe that will work and will be much faster. It also requires no caching or special buffering so it should be able to process 8 million records no problem. Let me know if that works, it was sort of fun to puzzle out.

Note: If this works, please mark topic solved as a "work-around", since it technically doesn't use looping to fix the issue...
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Post Reply