Duplicate entry issue - Lookup stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

No replies then?
I have a workaround for this but I'd still like my doubts clarified especially if my assumption for the duplicate rows returned is correct.
Tony
BI Consultant - Datastage
Klaus Schaefer
Participant
Posts: 94
Joined: Wed May 08, 2002 8:44 am
Location: Germany
Contact:

Post by Klaus Schaefer »

Tony,

"The first solution I tried was allowing multiple lines as output - which is the universal solution that I seem to come across everywhere, including on some of the threads here - however I cannot use that because as we know, activating that option throws out more lines than in my source thus making it essential for me to use a remove duplicates at the end again."

If this is the case - getting more lines as output compared to your source - it is the indication that your lookup key is NOT unique in your reference lookup data.

For the Lookup Stage "auto" or "entire" partitioning means the data is read in the beginning into a shared memory pool. Each partition then is doing a lookup into this shared memory pool. It's not that the data is copied 8 times, into each partition...

Best regards
Klaus

p.s.: You might check your partitioning for the job design in general.
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

Coming back to this and specifically to Klaus's post:

Klaus...

What you said is how I had understood the lookup stage functions - entire partitioning should NOT copy each line into each partition. However curiously enough in my setup that's exactly what's happening.

I tested with a lookup containing one match and verified that this lookup key exists exactly once - so no chance of any duplicates there.

Unfortunately if I don't activate the "allowing multiple lines" option I get a warning about duplicate lines and if I do activate this, I get 8 lookup results - my configuration file is defined on 8 nodes.

So the solution I've had to adapt is hash partitioning on both links (stream and reference) on the same field.

That's a workaround and seems to work.

It still goes against the theory that "Entire partitioning means the data is read in the beginning into a shared memory pool. Each partition then is doing a lookup into this shared memory pool."

Maybe its my server configuration (SMP/MPP) which is something I don't know how to ascertain.
Tony
BI Consultant - Datastage
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

TonyInFrance wrote:Maybe its my server configuration (SMP/MPP) which is something I don't know how to ascertain.
YES! That architecture is exactly what controls whether you get everything up in shared memory or everything spread out, duplicated on each node.

viewtopic.php?t=96993
-craig

"You can never have too many knives" -- Logan Nine Fingers
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

Thanks Craig. Just read through that. I'm guessing my server is an MPP architecture which I gather isn't that expensive after all since 2 SMP machines together can become an MPP server.

That said, is there any specific command to find out what the architecture is?

At least I'm sure now that I wasn't completely loco in my observations - so my single line is indeed getting duplicated due to the MPP server architecture and thus throwing a duplicate entry warning even when there's actually no duplicates in the reference dataset.

If I use entire partitioning on the reference link then the only way to eradicate that would be to activate multiple rows output and using a remove duplicates.

Else I guess I should use hash partitioning on both stream and reference?

Or stop using Lookup and use a Join?
Tony
BI Consultant - Datastage
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

Just came across this:

http://datastagedoctrina.blogspot.fr/20 ... re-is.html

If this is true then I'm on an SMP... and that means my theory in my previous post is wrong.

I thus have duplicate entries returned despite being on a SMP configuration - thus shared memory which is accessed by all 8 nodes.
Tony
BI Consultant - Datastage
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The easiest way to tell whether you're on SMP or MPP is to look at the configuration file. If the fastname is the same for every node, then you have an SMP environment. If not, then you have an MPP environment.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

Exactly what I did and found out that I'm on an SMP setup.

That said I don't understand why the tenets of SMP aren't being followed here - despite the shared memory pool, my reference file consisting of one and just the one line partitioned using the 'Entire' algorithm gets copied 8 times to my destination using a lookup stage - my configuration file has 8 nodes with identical fastnames.
Tony
BI Consultant - Datastage
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

Tony, I don't see an issue here. When you use "entire", it copies 8 times and you get the warning. When you use "hash", you don't get it because each node gets only one unique key. This is how it is supposed to work.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It should not happen, unless you Entire partition the stream input as well. Otherwise any value will be looked up only on the node on which it occurs; the Entire partitioning of the reference data set ensures that it will not be missed due to incorrect partitioning.

Otherwise you can partition the reference input identically to how the stream input is partitioned. Not really relevant in an SMP environment, where there is only one reference data set, in shared memory.

I suspect that there are smart ways to partition when using range lookups, but that's beyond the scope of the current thread.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

I agree with the theory on principle here Ray.
Unfortunately that's not what's happening in my case.

I'm on an SMP environment.
I've got my stream data partitioned as hash.
My reference data is entire partitioned.
The config. file I'm using is defined on 8 nodes with the same fastname (thus proving I'm on SMP)

With this setup, each row is returned 8 times (when I select the multiple rows option in my lookup stage) and unless I use a remove duplicates explicitly, I get redundant data.

If I don't use the multiple rows option, I get a warning that duplicate rows exist which is also something I don't want.

The workaround I've thus opted for is using a join instead of a lookup and hash partitioning both stream and reference links as is the norm for the join stage.
Tony
BI Consultant - Datastage
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

TonyInFrance wrote:If I don't use the multiple rows option, I get a warning that duplicate rows exist which is also something I don't want.
I've never seen a warning like this in a case where there wasn't actually a duplicate row based on the defined lookup key(s).

Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The duplicate rows warning should be per-node.

Have you examined the score to determine what kind of partitioning is ACTUALLY occurring?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply