did you know? You can create a unique index in parallel

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Teej wrote:
kcbland wrote:Do a select max surrogate key on the target table prior to the job run and pass it in as a job parameter.
*bang my head on the table* Why didn't I think of THAT?

Hey wait...

How do you pass values FROM the select into a job parameter? Right now I'm thinking of this being done in a before-job routine...

-T.J.
Not a before routine, can't update job parameters once a job is running.

There's 2 really slick solutions. This would be my favorite. Write a text file listing all tables you need to go do a select max on. Write a sql script that takes as a parameter the table name. Write a anonymous block pl/sql procedure that takes a table name parameter, looks at the table ddl and gets the primary key column. Then, do a select max on that table.column. Now, write a controlling shell script that reads the text of of table lists, and in a loop peel off the table name and then run a sqlplus session with a background & giving it the anonymous block pl/sql script with the table name piping output to a log file. After all are fire off do a wait statement. Then, parse all output files for the max value output and stuff into a nice text file of table | max value.

Add to your job control the ability to read this file and feed as a parameter the appropriate max value. Or, if using server jobs, read this file into a stage variable on initialize, parsing for the required table.

Another technique is to make a centipede looking job, with a bunch of OCI stages doing select max's on their respective table, sending output to the same hash file of tablename and max value. Dump the hash file to text file if wanted, or leave as hash and read from there.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

From the Ascential announcement for 7.0.1 recently posted on ADN:
ADN wrote:A new parallel Surrogate Key Generator stage is available in DataStage Enterprise Edition. This will assist in data warehouse projects creating dimensional data.
Note that this is for 7.0.1, not 7.1 and is available now.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is this really a stage, or just one of those helper macros that create part of a job?
Forgive me for being cynical, but I can't believe there's a "one size fits all" solution for surrogate key handling.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

ray.wurlod wrote:Is this really a stage, or just one of those helper macros that create part of a job?
No clue as we are still on 6. Perhaps some kind soul who upgrades in the near future could let us all know!
-craig

"You can never have too many knives" -- Logan Nine Fingers
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

I'm trying to get to download this today, and get it installed on a test server. Definitely will check this feature out.

Most likely it will require a parameter with the starter key value that you have to get from somewhere. Not too bad, since we already use a script to run jobs -- can always extend it to pull the current key number.

But my concern is that the 'sequence' key on Oracle may not get updated, and I still haven't test that theory with the idea presented on this thread yet (it's on the list of things to do, but delayed).

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Sohail
Participant
Posts: 10
Joined: Wed Jul 02, 2003 3:08 pm
Location: USA

Post by Sohail »

We also faced the same issue when we try to generate surrogate keys using the Oracle Sequences.

We did it two different ways, in the Server job we use OracleSeq.NextVal to get the next key, which works fine as Server execute each SQL in the database whereas the PX doesn't, it brings the same value every time from cache if the sql is same, it can be done by choosing the parse option in PX but it's very slow.

So we are using lookup stage to get the MAX Key from the table on some dummy value 'X', which just execute this SQL once and returns the MAX from the memory every time. Then we have Column Generator to generates a unique number starting with the Part and increment by Partcount, then add that generated column with the MAX value in the transformer. This logic takes care of partitioning on any number of nodes to generate unique keys across all partitions.

I'm not sure if I have explained it well, but if you have any question, please let me know.

Thanks,

Sohail.

P.S. about the surrogate keys stage, I couldn't find that in 7.0.1, anyone has any details about this? will be appreciated.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

Sohail wrote:about the surrogate keys stage, I couldn't find that in 7.0.1, anyone has any details about this? will be appreciated.
It's a PX stage provided by default. I'm still waiting word from management for installing this.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Sohail
Participant
Posts: 10
Joined: Wed Jul 02, 2003 3:08 pm
Location: USA

Post by Sohail »

Teej wrote: It's a PX stage provided by default. I'm still waiting word from management for installing this.
-T.J.
I couldn't find anywhere this stage in the PX version 7, I even have asked Ascential Support and here is the reply from them

"There is a key generator routine available on the server but I am not
familar with one on the PX stage. Would you find out more information and let me know? "

Could you please let me know where exactly you have found? Or is there any special Stage need to be install separately?

I appreciate your help,

Thanks,

Sohail.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

According to the help guide for PX specific information:

Code: Select all

Behavior of Surrogate Key stage (ECASE 43240)
-------------------------------
The behavior of the Surrogate Key stage is different in a specific case on each platform.  It is different when the key reaches the maximum value for each supported data type (including unsigned). Here is a table of how it behaves:

[details of every platform's limitations]
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

kcbland wrote:Do a select max surrogate key on the target table prior to the job run and pass it in as a job parameter.
Hi All,
just to be a little 'difficult'...

I used to do this in the early 90s...the limitation is that I often wanted a 'high' value in the dimension table of 999999999. So, when I did that I changed to counting the number of rows.......however, counting the number of rows is very dangerous as I found out when the 'human error' of deleting rows happened to me. :(

So, since 1995 I have used a separate place to get integer keys from. DS now has a routine for this but before that I always put the integer key number in a table.....I also don't like numbers generated by the database, they restrict your ability to unload/delete/reload the dimension tables...

Just my opinion and how I have been doing it for 9 years now.....it works well and the only problem is the extra complexity of getting the key and making sure that the table that holds the key is backed up....
Best Regards
Peter Nolan
www.peternolan.com
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

kcbland wrote:Anyone who attaches meaning to the value or to the tight fit assignment is committing a fundamental error in the eyes of the Data Warehousing Gods, who say surrogate keys are meaningless values only there for uniqueness.
Kenneth is correct that many people say this......

However, there is a really useful place to put some meaning into a surrogate key, and that is the day/month dimensions. If you make the key sequentially increase one number for one day/month you can then easily use these keys for partitioning tables which is very important. DW tables are most frequently partitioned on a monthly basis.

This is the ONLY place where one should put some intelligence into keys. Trust me, I tried intelligent keys many times and each time came to grief!!! Especially the one of generating the key to the day dimension.....
Best Regards
Peter Nolan
www.peternolan.com
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

The Time dimensions are the only family of tables where I would use a smart surrogate key. Simply take the dashes out of YYYY-MM-DD and you end up with a number. Take out the dashes, spaces, and colons from YYYY-MM-DD HH24:MI:SS and you end up with a 14 digit number. This is grey area, but I would also agree it is valid (I should, because I do it).
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

kcbland wrote:The Time dimensions are the only family of tables where I would use a smart surrogate key. Simply take the dashes out of YYYY-MM-DD and you end up with a number. Take out the dashes, spaces, and colons from YYYY-MM-DD HH24:MI:SS and you end up with a 14 digit number. This is grey area, but I would also agree it is valid (I should, because I do it).
Hi Kenneth,
I did this when I was but a DW pup in 1993.....it works fine when you are using a flat model without much summarisation. Where it failed for me was when I wanted to create lots of levels of summaries across time with the minimum processing required to create the summaries......I then started using calculations on the YYYYMMDD column to give me weeks, months, quarters, years etc....and after a while I ran out of integer ranges that could be used which did not run over each other.....so since then I have used a day_dim lookup table to translate the date into a key as well as to return the keys for all the levels of time I want to summarise by.....I only use the fact that they are ascending integers to enable ease of partitioning....

Lot's of folks do the same as you do in order to not have to do the lookup to the day_dim table. And in most of these cases (that I have seen) there are few summaries or data is re-summarised rather than being incrementally updated....
Best Regards
Peter Nolan
www.peternolan.com
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I said I'd use a smart key for the surrogate, but that does not eliminate the need for a TIME subject area of tables. You still have to do a join to the appropriate table (D_DAY, D_HOUR, etc) to get to the attributes for aggregation/reporting purposes. I only use a smart key so that a date column is "readable". I NEVER perform on the fly math to the column, I always use the join. Sorry for the confusion.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

using any intelligence in keys

Post by peternolan9 »

kcbland wrote:I said I'd use a smart key for the surrogate, but that does not eliminate the need for a TIME subject area of tables. You still have to do a join to the appropriate table (D_DAY, D_HOUR, etc) to get to the attributes for aggregation/reporting purposes. I only use a smart key so that a date column is "readable". I NEVER perform on the fly math to the column, I always use the join. Sorry for the confusion.
Hi Kenneth,
no confusion on my part... :D I did the same.....calculated the key and then did the lookup to D_DAY (or similar).

My point, (for others) is that designs I've been building since 1993 have included incremental updates for summary levels and each of the rows for these summary levels also needs a key, so I used the YYYYMMDD key and put it thru a calc to get the week, month, quarter etc......to then also look up aggregate levels in the D_DAY table.....the problem I hit was having many summary levels and running out of guarateed unique key ranges because all keys over all levels had to be unique.....

So, in that specific case, I had a problem that was solved by not using any intelligence except the fact that the key increments by one for each day/month/quarter etc so it can be used for partitioning......when sumaries are rebuilt or multi-level tables do not exists this problem does not occur.

(Not many people know how to build DWs like this and they are not widely reported...Indeed, Metaphor (when Ralph was there) used to design stars this way though now Ralph has moved more to the view that each separate summary should be in it's own table to make like a little easier for query purposes. Ralph and I agree to disagree on that point since the way Metaphor used to build DWs implements the same logical implementation as having separate dimension/fact levels in separate tables but vastly reduces the ETL coding effort and DBA workload. I go for anything that reduces ETL effort and DBA workload because these are still major costs of any DW implementation.)

All my sampel code from that time (Cobol and C) is on my web page for anyone who wants to read it......
Best Regards
Peter Nolan
www.peternolan.com
Post Reply