Group by clause implementation in Datastage

rumu · Post by **rumu** » Fri Feb 03, 2017 3:15 am

Dear Members,

I have a requirement as below:

Select first row
From TEXT
Where TEXT.OID = 2498
And TEXT.TEXT in ('Discharge', 'Expired')
Group by TEXT.OID
Order by TEXT.STARTDTIME Asc

The source TEXT is Flat file.There are mutiple STARTDTIME for same OID and we have to pick up the least one .
We designed as below:
SEQFile Stage---->RDC-----Xfrm-----Target dataset

In RDC,we used OID as key and Partitioning tab
did HashPartitioning on OID and Sort on STARTDTIME.Not used Sort on OID.

We received proper output.Wanted to check with you if this approach is fine or we need to sort OID also along with Partitioning.

Thanks,
Rumu

chulett · Post by **chulett** » Fri Feb 03, 2017 8:23 am

So... RDC is the "Remove Duplicates" stage, yes?

Yes, one would need to sort on all of the "group by" fields, in this case OID. I would wager if you did more extensive testing on larger data sets you would no longer receive 'proper output'.

rumu · Post by **rumu** » Fri Feb 03, 2017 9:03 am

Thanks Chulett,noted this point.

rameshrr3 · Post by **rameshrr3** » Wed Feb 08, 2017 6:52 pm

Only one group - since TEXT.OID is filtered on single value ? Whats the point - curious

chulett · Post by **chulett** » Thu Feb 09, 2017 8:13 am

What's the point of what?

rameshrr3 · Post by **rameshrr3** » Thu Feb 09, 2017 11:41 am

From TEXT
Where TEXT.OID = 2498
And TEXT.TEXT in ('Discharge', 'Expired')
Group by TEXT.OID

If they were using SQL , a group by would not be needed

chulett · Post by **chulett** » Thu Feb 09, 2017 12:16 pm

When people post examples they rarely fully match reality (if at all) so I'll typically answer with general / generic advice. Did much the same here so they can apply it to whatever their specific situation is, as needed.