sort before join

pierreroulph · Post by **pierreroulph** » Mon Oct 06, 2003 2:48 am

Let assume we have a dataset used several times in N join stages (same key for all join : K)

Is sorting the dataset before using it in the join stages a good idea ?
(sort key should be K of course)

Does the join stage do the sort anyway ?
Does "re-sorting" a dataset cost time ? (stable sort ?)

PR

En ETL la vie est belle

bigpoppa · Post by **bigpoppa** » Mon Oct 06, 2003 12:24 pm

In parallel extender, all datasets that are inputs to a single join must be partitioned and sorted on the same keys prior to the join.

In Parallel Extender 6.0, I believe that partitioners and sorts are automatically inserted into the jobs, so that a novice PX user doesn't have to understand partitoning and sorting. However, my recommendation is to turn off the auto- sort/part- insertion and to understand partitioning and sorting before you build a single PX job.

Resorting your dataset is a very time- and space-consuming activity if you have a large volume of data. Also note: typically there is no reason to do a sort on a non-partitioned dataset, so usually any sort in a PX job has a corresponding partitioner.

- BP

Peytot · Post by **Peytot** » Thu Oct 23, 2003 6:38 am

If you use PX you have more millions rows I suppose. So for the sort, use tools like Syncsort. You will win in performance.

Pey - Salut la France