Link Order in transform stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
longma98
Participant
Posts: 9
Joined: Thu Aug 14, 2003 5:12 pm

Link Order in transform stage

Post by longma98 »

I need to do an update (conditional) and insert (unconditional) in order.
I have 2 output links from a transformation stage, and specified the link order in the tranformation stage with update link before the insert link.

However, it seems the ordering is not strictly enforced in PE. Sometimes the update would occur after the insert, so the new rows would be updated as well.

I searched the other forum, and got an idea to add an extra transformation stage to artificially delay the insert (so there are 2 transformations before the insert vs. 1 transformation before update).

Now my question: will that solution/hack always guarantee an update before insert?

Other unrelated questions.

1. In terms of performance, if the transformation is simple enough that can be done using SQL case/decode construct, should it be done in SQL select, or in a transformation stage? What other considerations are there?

2. In terms of performance, if a join/lookup can be done using SQL sub-select, or something like this:

select * from A,
(select * from B where...) as tmp_table_B,
where...

Should it be done in the SQL select, or a join/lookup stage? What other considerations are there?

Thanks

LM
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Post by bigpoppa »

1. Do not rely on an extra transform to solve your update/insert timing issues. I recommend that you land your inserts to a dataset and then load that dataset into the database in a separate job. Use a sequencer to load the inserts after the updates.

If you don't mind loading the database sequentially, you could set an 'update' flag in the records, and then sequentially sort the records so that the update records move to the top of the record set and inserts move to the bottom. Then, when you load the record set sequentially, you know that your inserts will be loaded after the updates.

2. You can do simple transforms in SQL if you like. SQL code in PX is 'hidden' from the viewer, so you lose readability when you do transform logic outside of the transformer. From a performance standpoint, I don't think you'll see much of a difference. Your best bet is to try it both ways and see what is comfortable to you from the standpoints of performance and code readability.

3. The answer to the third question depends on many variables. For instance, lookups in PX take advantage of large amounts of memory. If you do not have a good amount of memory, your PX lookups and joins might not be performant. When a lookup is performed through PX on an SMP, all partitions share the same copy of the lookup table in memory. On an MPP, each partition gets a full copy of the lookup table, if you choose to use default partitioner 'entire'. You can speed up a lookup on an MPP by hashing it by the lookup keys. Either way, you want to make lookups (or lookup partitions) small enough that they fit fully in memory.
Similarly, the partitions of the right data set in a PX join on an MPP should fit in memory.

If the tables are already indexed by the join/lookup keys, you should consider running the join/lookups in the database. It might run faster than if you did the join/lookup in PX. The same rule applies for sorting data, too.

HTH,
BP
Post Reply