Check if multiple rows returned from Lookup Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sensiva
Premium Member
Premium Member
Posts: 21
Joined: Tue Aug 22, 2017 10:39 am

Check if multiple rows returned from Lookup Stage

Post by sensiva »

Hello All,

Not sure if this type of requirement is common, but i felt it was a little strange.

The requirement is that, a lookup needs to be done on a table which could potentially return multiple rows. If one row is returned, then process the records, but if multiple rows are returned, then ignore the record.

I planned of using a lookup stage with multiple rows returned option, and then thought of using sort stage with key column change option. But still with that I would know, if i get multiple records only from the second record, which means I would have processed the first record by then. I am thinking of other methods, but its getting complex.

Any ideas of how else could this be achieved. I was thinking of remove duplicates as well.
sen
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

can you rig some sort of having clause into your lookup query so that if it has more than 1 it returns nothing?
myerdsdupp
Premium Member
Premium Member
Posts: 4
Joined: Thu Oct 25, 2018 8:24 pm

Post by myerdsdupp »

Use an aggregate stage after a lookup and get the count out of it. You can use filter or transformer stage to pick only records where count = 1.
sensiva
Premium Member
Premium Member
Posts: 21
Joined: Tue Aug 22, 2017 10:39 am

Post by sensiva »

Thanks for your replies

I see both similar as in the first case, we filter with oracle and the other with datastage.

I was asking this question to myself for a while now, there are operations that could be done with oracle and with datastage also, like sort, group by, distinct etc.. My collegues have always been saying to retrive all records from oracle and do all the other functionalities using datastage as its more performant and is considered as a best practise. But not sure if its true always. As our oracle is well partitioned and indexed.

As in this case, if we could implement the having clause, we would treat the same number of records as we receive from the main link, where-as in case of using the lookup stage, we would multiply the number of records to treat(if we have 1million records to treat, and if we have atleast 3 matches for each line we end up treating 2-3 million records, grouping them by storing them in memory and then filtering.

your views.

I am little inclined to using having option. Lets see how the team reacts
sen
myerdsdupp
Premium Member
Premium Member
Posts: 4
Joined: Thu Oct 25, 2018 8:24 pm

Post by myerdsdupp »

I would prefer using a having clause in oracle.

If your data is huge in the reference table then lookup is not the right option.

You can always do a trail and run with your design but try to follow the approach with avoids the bottleneck on the server.
mouthou
Participant
Posts: 208
Joined: Sun Jul 04, 2004 11:57 pm

Post by mouthou »

I think your colleagues are not entirely correct and working up entirely in Datastage is a selective scenario. Best practice is to push the operation to DB so the minimal preprocessing like yours happens while selecting the data itself.

'Multiple rows returned option" did what it is supposed to in your case. If that is not your objective, you could try co-related queries similar to your having clause option except that GroupBY is kept separate so that you can select key and all other columns in the outer query)

SELECT <MAINTAB.requiredcolumns>
FROM MAINTAB, (SELECT keycol1 from MAINTAB GROUP BY KEYCOL1 HAVING COUNT(*) = 1) MAINTABuniq
WHERE MAINTAB.keycol1 = MAINTABuniq.keycol1

OR use RANK function if you are familiar with the syntax
Post Reply