did you know? You can create a unique index in parallel

bigpoppa · Post by **bigpoppa** » Thu Mar 27, 2003 10:52 am

Creating a unique index while running in parallel can be a challenge. The easiest way to do this is to:

Collect parallel streams into a single stream and use a midstream generator or transformer to generate the index field.

This method introduces a bottleneck in your parallel data flow because it adds a point of collection. And if you're doing downstream processing in parallel, the data will have to be rehashed.

Alternatively, you can use APT environment variables to create the index. You can use APT_PARTITION_COUNT and APT_PARTITION_NUMBER to generate the index.

APT_PARTITION_COUNT provides you with the current degree of parallelism. For instance, if your config file specifies 4 processing nodes, then APT_PARTITION_COUNT would be 4.

APT_PARTITION_NUMBER gives you the number of the current processing node. The instance of the stage you're running can be 0 thru APT_PARTITION_COUNT minus one. So in example above, APT_PARTITION_NUMBERs would be between 0 and 3.

To generate the index, use the transformer stage to set APT_PARTITION_NUMBER to be the initiating index value. With each new row, set the index value to the previous index value plus APT_PARTITION_COUNT. This method ensures that all of the output rows will have a unique index.

The caveat is that the index may not necessarily be sequential. For instance, if you had 3 total records going into a 4-way partitioned transform, you do not have any way of determining which of the four partititioned transforms are going to provide the index. You could have index values with any permutation of 0, 1, 2, 3 taken in 3's - meaning you could have index values of 0,1,2 or 0,1,3 or 0, 2, 3 etc.

vdr123 · Post by **vdr123** » Tue Jan 13, 2004 9:35 pm

Following ur way to generate surrogate keys...
I guess u have assumed number of records in the each partition to be EQUAL, which is NOT the case most
of the time when we sort & partition(with using different partitions) the records.
If we dont get EQUAL number of records in each partition, we will be left with blanks in between.

Eg: if each of the 4 partitions have number of input records as 1, 2, 3, 4
we will be missing index#4 on the data record.

-Is it OK to generate keys which are not in order for the surrogate keys...does it create any problems if we
run out(well its only a MAX of 4 numbers you will be missing in the worst case)

PLEASE VERIFY YOUR ASSUMPTIONS...

kcbland · Post by **kcbland** » Tue Jan 13, 2004 9:48 pm

Gaps in surrogate key assignments are okay. Anyone who attaches meaning to the value or to the tight fit assignment is committing a fundamental error in the eyes of the Data Warehousing Gods, who say surrogate keys are meaningless values only there for uniqueness.

From the technique supplied, there's no issue that there are gaps. The skewing effect is based on the method of partitioning the data. Round-robin is the only way to guarantee and absolutely pure distribution of data, with the except of the last few rows if your total row count is not evenly divisable by your partition count. Hash partitions have the ability to heavily skew, but that still is okay. If you have 4 partitions, and all rows hash down the same partition, the worst thing that will happen is that surrogate key assignment happens by a multiple of 4 for that run. Whoop-dee-doo. Make sure your surrogate primary key is sufficiently large and there's no worries about gaps.

The technique is sound. Your only issue is the initial seed value if building on an existing populating table. Every partition will have to start at that seed value and assign from there.

bigpoppa · Post by **bigpoppa** » Wed Jan 14, 2004 7:54 am

You are gauranteed uniqueness if you use the technique above for generating a surrogate key. You are not gauranteed that there will be no blanks, which is not a problem - as Ken points out - if you don't assign other meaning to the surrogate key besides uniqueness.

- BP

vdr123 · Post by **vdr123** » Wed Jan 14, 2004 9:17 am

Thank You for confirming that gaps in the keys dont matter if used only for uniqueness...
Also, If I have to store the last key (index) generated for a run how would i do it?...meaning will I be able to figure out which is the last value generated from the key generation...so that i can start my next run from the last MAX key and proceed with the same way to create keys for the daily loads?

kcbland · Post by **kcbland** » Wed Jan 14, 2004 12:01 pm

Do a select max surrogate key on the target table prior to the job run and pass it in as a job parameter.

vdr123 · Post by **vdr123** » Wed Jan 14, 2004 1:19 pm

Well Sure...is there a way I can figure it out and store the value in a file at the end of day's processing so that i dont have to use DB queries to select the MAX...i guess the funtion in DS (key-management) uses this kind of concept...
I can read the MAX value previously written to the file and retrive it before running the next days run.

kcbland · Post by **kcbland** » Wed Jan 14, 2004 2:16 pm

I wouldn't design it as you suggest.

1. What if the file is missing or corrupted?
2. What if the file is updated, but for whatever reason the data is invalidated and removed after loading. Now you have to reset this value for all affected tables.
3. Where are you going to store it that's secured? I've had (over my dead body) to do solutions where the max was stored in a table post-run. Not my choice.

What's the concern about a select max? If your target is not partitioned, then the primary key is indexed. A select max should take a scan of the index. If the target is partitioned with locally managed indexes, then it's a parallel query of local indexes. It's just about the same to have 32 processes scanning 1/32th the data for a max, then maxing the maxes from the 32; as opposed to a parallel index scan on a partitioned index, maxing the maxes.

In fact, you could do just-in-time delivery of the select maxes processing by the time the job stream has stepped to the point of needing it. Think of it as throwing a ball into the distance and then running down to catch it.

vdr123 · Post by **vdr123** » Wed Jan 14, 2004 3:03 pm

I have not decided what to use for surrogate keys...just exploring different feasible options i can finalize on later...There is always those things when we deal with storing things in a file...But i guess the DS process of key-management is using the hash-files to store the value to be retrieved for the next run.
I will be using PX for generating keys(using APT_ variables...)...hope it will not create any problems as discussed before...

kcbland · Post by **kcbland** » Wed Jan 14, 2004 3:14 pm

vdr123 wrote:But i guess the DS process of key-management is using the hash-files to store the value to be retrieved for the next run.

Select max on the table should be the preferred choice. The exposure is least, restart is no issue, and it always works unless the instance is offline, at which you wouldn't be doing anything anyway. This is one of my best practices.

vdr123 wrote: I will be using PX for generating keys(using APT_ variables...)...hope it will not create any problems as discussed before...

You'll still need the seeding value...

vdr123 · Post by **vdr123** » Wed Jan 14, 2004 4:54 pm

What do u mean by SEEDING value?
I will be needing:
-MAX value from the table
-APT_PARTITION_COUNT
-APT_PARTITION_NUMBER
-incremental number(1 here)
HOpe the APT_* variables are declared in the ENV and can get using the ENV variables...I read some where in the doc's of Ascential that we cannot use $ENV in paralle mode???
But I will also be using the server mode to run some of my jobs(like calling QS jobs)...so during this time i can get the APT_PARTITION COUNT...not sure how to get PARTITION NUMBER(as its available when its running on that perticular partition node)

Please through some light on this!

kcbland · Post by **kcbland** » Wed Jan 14, 2004 5:51 pm

Seed value, the max value currently assigned that "seeds" the next processing with its assignment starting point.

Teej · Post by **Teej** » Wed Jan 14, 2004 10:20 pm

bigpoppa wrote:Alternatively, you can use APT environment variables to create the index. You can use APT_PARTITION_COUNT and APT_PARTITION_NUMBER to generate the index.

Hmm...

I wonders how this would hold up against an Oracle table with a surrogate key that have a generator on board. Does this generator automatically update itself when you create a new record with a number, or do you have to explicitly tell it to get its number up past the point of safety?

The issue here at this location is that not only DataStage is handling this particular table, but other programs written in other languages are also accessing that table. They would be using the .NEXTVAL command to get the 'guarantee' free key value.

I will have to experiment with this at work. If I could get this working well, I can easily eliminate a custom stage (that directly pulls from Oracle to generate a key), and add some code to my front end program to get the base value to start with for the key.

As it is with our traditional development, we do not allow multiple access to the same table by different programs, so crossovers are not a major concerns.

Thanks for the excellent tip!

-T.J.

Teej · Post by **Teej** » Wed Jan 14, 2004 10:23 pm

kcbland wrote:Do a select max surrogate key on the target table prior to the job run and pass it in as a job parameter.

*bang my head on the table* Why didn't I think of THAT?

Hey wait...

How do you pass values FROM the select into a job parameter? Right now I'm thinking of this being done in a before-job routine...

-T.J.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 14, 2004 10:59 pm

Can't do it from a before-job routine, because this runs as part of the job itself, and parameter values can't be changed once a job is running.
Can do it from a controlling job.

DSXchange