dimensions that change over time

vmcburney · Post by **vmcburney** » Mon Mar 10, 2003 5:10 pm

DataStage does not have any data warehousing functionality built in, it is a straight ETL tool for all kinds of data integration and does not specialise. There is a DataStage Best Practices training course which contains some very good ideas for setting up data warehouse loads which may be what you are looking for.

I think the main DataStage feature that will help you maintain your warehouse dimensions is the hash file lookup. This lets you match legacy data primary keys to the unique key of your warehouse dimension giving you a fast and easy way to update those dimensions.

Vincent McBurney
Data Integration Services
www.intramatix.com

mihai · Post by **mihai** » Mon Mar 10, 2003 6:48 pm

Hi

As Vincent said, there is are no specific things in DataStage that will help you build/manage slowly changing dimensions.
On the surface, this may sound like a bad thing (i.e. no 'out of the box SCD', like ActaWorks/Informatica provide). In practice, it's actually a good thing. In my limited experience of automated SCD builds, it turns out that flexibility is better than availability, since a job suite developed for the task in hand will invariably be more 'tunable' than a generic solution.

Good money is to be made by DS consultants developing SCD jobs, so you may find it easy to get advice, but difficult to get deliverables [:D]

For what it's worth, here's how Type 2 SCD's were implemented for a client in the past.

a. Identified the source system primary key for the dimension and stored the information in a metadata repository somewhere.
b. Identified the 'interesting' columns (i.e. columns where a change in the values indicates a new record is to be created) and stored the information in a metadata repository somewhere.
c. The other columns are 'not interesting' (i.e. the latest version of the record is updated, no new records inserted) and stored the information in a metadata repository somewhere.
d. Identifid the validity period for the new records.

As a side-note, try to only get changed data from the source system. Granted, this is opening the CDC discussion, which deserves a whole post to itself.

The SCD job took the table name as a parameter to make it truly generic. The initial input stage read a single-field record. The metadata repository mentioned above defined how the record is to be handled.
The metadata storage was interrogated to identify the PK components, and they were concatenated in a single field.
The same was done for the 'interesting' columns and the 'non-interesting' columns.
We ended up with a 'record' composed of the source system (composite) primary key in field1, the interesting columns in field2, the other columns in field3 and the validity period start & end (field4 and 5).

The same kind of transformations were applied to the data already in the warehouse, this it was possible to compare what's been loaded against what's just arrived using the same DataStage job. Constraints were used to control the data flow to the job and generate insert/modify/delete/no action statements as appropriate.

Did that help?

Kind regards,
Mihai

_________________________
desk direct:+441908448571

ray.wurlod · Post by **ray.wurlod** » Tue Mar 11, 2003 4:09 am

There is coverage of handling SCDs in the "DataStage Best Practices" class offered by Ascential.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518