sort before join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pierreroulph
Participant
Posts: 4
Joined: Mon Oct 06, 2003 2:16 am

sort before join

Post by pierreroulph »

Let assume we have a dataset used several times in N join stages (same key for all join : K)

Is sorting the dataset before using it in the join stages a good idea ?
(sort key should be K of course)

Does the join stage do the sort anyway ?
Does "re-sorting" a dataset cost time ? (stable sort ?)


PR
En ETL la vie est belle
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

sort before join

Post by bigpoppa »

In parallel extender, all datasets that are inputs to a single join must be partitioned and sorted on the same keys prior to the join.

In Parallel Extender 6.0, I believe that partitioners and sorts are automatically inserted into the jobs, so that a novice PX user doesn't have to understand partitoning and sorting. However, my recommendation is to turn off the auto- sort/part- insertion and to understand partitioning and sorting before you build a single PX job.

Resorting your dataset is a very time- and space-consuming activity if you have a large volume of data. Also note: typically there is no reason to do a sort on a non-partitioned dataset, so usually any sort in a PX job has a corresponding partitioner.

- BP
Peytot
Participant
Posts: 145
Joined: Wed Jun 04, 2003 7:56 am
Location: France

Post by Peytot »

If you use PX you have more millions rows I suppose. So for the sort, use tools like Syncsort. You will win in performance.

Pey - Salut la France
Post Reply