Slowly Changing Dimension

Archive of postings to DataStageUsers@Oliver.com. This forum intended only as a reference and cannot be posted to.

Moderators: chulett, rschirm

Locked
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Slowly Changing Dimension

Post by admin »

Hello all,

Does any one know how to implement "Slowly changing Dimension" concept in DataStage ?

Thanks in Advance,
Mathi

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly
prohibited and may be unlawful.

Visit us at http://www.cognizant.com
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Begin by thinking through the logic. When loading the dimension table, does the row already exist? If no, generate key value and insert new row. If yes, is the row changed? If no, do nothing. If yes, generate key value and insert new row.

The reference lookup must be done on the original key value, not the surrogate key value. Therefore take a copy of only the auditable columns from the dimension table into a local hashed file (this will yield much better performance than a lookup on a secondary key). Use the hashed file to feed a Transformer stage via a reference input link, and constrain the output of the Transformer stage to "reference input key is null (not found) or row is changed (columns are different between stream input and reference input)".

Auditable columns are the ones you need to compare to see whether a difference has occurred.

There are many ways to generate a surrogate key value. Prefer to use a SEQUENCE/IDENTITY column if the target database allows. Otherwise, depending on your skill levels, use the SDK key management routines, or write your own routines: a before-stage subroutine to get the current maximum value of the surrogate key and load it into a variable in COMMON, and a transform function to increment this variable and return the new value.

Loading the fact table then involves loading a hashed file with just the true code value (as its key) and the maximum surrogate key value for that code value, and using this hashed file to convert the true code to the corresponding surrogate key value. The hashed file has only two columns.

-----Original Message-----
From: Venkatachalam, Mathialagan (CTS) [mailto:VMathial@chn.cognizant.com]
Sent: Wednesday, 07 March 2001 17:39
To: informix-datastage@oliver.com
Subject: Slowly Changing Dimension


Hello all,

Does any one know how to implement "Slowly changing Dimension" concept in DataStage ?

Thanks in Advance,
Mathi
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Hi Mathi,

Have you looked at the documentation for Changed Data Capture at the
following URL: http://www.informix.com/answers/english/pcdc4x.htm

Regards,
Anthony.

--- "Venkatachalam, Mathialagan (CTS)" wrote: > Hello all,
>
> Does any one know how to implement "Slowly changing Dimension" concept
> in DataStage ?
>
> Thanks in Advance,
> Mathi
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient, please contact the
> sender by reply e-mail and destroy all copies of the original message.
> Any unauthorised review, use, disclosure, dissemination, forwarding, printing
> or copying of this email or any action taken in reliance on this e-mail is
> strictly
> prohibited and may be unlawful.
>
> Visit us at http://www.cognizant.com
>


_____________________________________________________________________________
http://store.yahoo.com.au - Yahoo! Store
- The fastest, easiest way to open an online store.
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Tis not quite the same thing. Even though the data have changed in the operational system, it does not necessarily mandate a change in a dimension table. And CDC does not address the question of surrogate keys.


-----Original Message-----
From: Anthony Corrente [mailto:anthony_corrente@yahoo.com.au]
Sent: Wednesday, 07 March 2001 18:12
To: informix-datastage@oliver.com
Subject: Re: Slowly Changing Dimension


Hi Mathi,

Have you looked at the documentation for Changed Data Capture at the following URL: http://www.informix.com/answers/english/pcdc4x.htm

Regards,
Anthony.

--- "Venkatachalam, Mathialagan (CTS)" wrote: > Hello all,
>
> Does any one know how to implement "Slowly changing Dimension" concept
> in DataStage ?
>
> Thanks in Advance,
> Mathi
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient, please contact the
> sender by reply e-mail and destroy all copies of the original message.
> Any unauthorised review, use, disclosure, dissemination, forwarding,
printing
> or copying of this email or any action taken in reliance on this
> e-mail is strictly prohibited and may be unlawful.
>
> Visit us at http://www.cognizant.com
>


____________________________________________________________________________
_
http://store.yahoo.com.au - Yahoo! Store
- The fastest, easiest way to open an online store.
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Mathi

I have implemented type 2 slowly changing dimensions in the past. I used the checksum function within DataStage to determine if the fields that you are concerned with truly changed. If the checksum of the existing field equals the checksum of the the incoming field then obviously nothing has changed and you do nothing.

However, if those sums are different, then we added an end date to the existing record and created a new record with a new surrogate key and a new start date for the new entry.

If you need more details on how to further implement using the checksum function, please email me directly at rsscoggins@msn.com.

Thanks
Stacy Scoggins
--- "Venkatachalam, Mathialagan (CTS)"
wrote:
> Hello all,
>
> Does any one know how to implement "Slowly changing Dimension" concept
> in DataStage ?
>
> Thanks in Advance,
> Mathi
>
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient, please contact the
> sender by reply e-mail and destroy all copies of the original message.
> Any unauthorised review, use, disclosure, dissemination,
> forwarding, printing or copying of this email or any action taken
> in reliance on this e-mail is strictly
> prohibited and may be unlawful.
>
> Visit us at http://www.cognizant.com
>
>


__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail.
http://personal.mail.yahoo.com/
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Mathialagan,



DataStage does not have a wizard as such to create these mappings for you, however they are not hard to do. Now that you have labelled them, we use types 1 & 2 extensively in our warehouse. I would think that type three would present difficulties to most reporting tools, but, nonetheless, DataStage could still handle it.

Our warehouse is in Oracle, so Im generally speaking in the context of how DataStage interacts with Oracle.

Type 1

The output stage (in our case, an Oracle OCI plug in) can do either an "insert else update" or an "update else insert". The developer would choose based on whether more inserts or updates are expected. For SCD, I would expect that "update else insert" would generally be the preferred option and in practice we have found this to perform the best.

Type 2

DataStage does not build this automatically or with a wizard. However it is not hard to do and we use it extensively.

The logic partly depends on how good your changed data capture (CDC) is. One of our interfaces uses CDC based on triggers in the source system to record changes. However, not all changes to the source records are relevant to us, so it is not a forgone conclusion that a changed record presented to our warehouse is really changed from our perspective.

Obviously the dimension table would have a surrogate key as well as what I refer to as a natural key (ie the source system key plus time). DataStage could look up the dimension table to see if any of the attributes have really changed (this is optional, depending on the quality of the CDC). If a change is detected, update the end date of the old record and insert a new record. We use Oracle sequences to generate all our surrogate keys. DataStage can simply select this with a reference link or if you write a custom query for the insert, you can include it there (but why write code if you dont have to). Ray Wurlod wrote a good posting about alternative approaches to this some time back.

Type 3

The biggest issue here is your table design in the warehouse. DataStage can do whatever you need to populate columns of the target table in its transforms. These can be very simple or very complicated. We have an equipment hierarchy table in our warehouse that allows for up to 9 levels in the hierarchy. Every level is held on the single dimension record, so I guess it is a form of what you call Type 3. This involved a transform for each level. Each transform checked to see if the current record had a parent. If it did, it passed the hierarchy so far to the next transform, which does exactly the same thing. If it didnt, it wrote the completed hierarchy data to the target table.

A further interesting point is that the native database under DataStage is UniVerse which handles these multi-value constructs very nicely. However, I have little experience with UniVerse, so Ill leave this for others to explain.

I hope this helps a little. Im sure others will find plenty more to add to it.


David Barham
Information Technology Consultant
InformAtect Pty Ltd
Brisbane, Australia

e-mail: david@barham.hm


-----Original Message-----
From: Venkatachalam, Mathialagan (CTS) [mailto:VMathial@chn.cognizant.com]
Sent: Wednesday, 7 March 2001 20:56
To: David Barham
Subject: RE: Slowly Changing Dimension

David.

Below explained SCD in Informatica..


Regards
Mathialagan

SLOWLY CHANGING DIMENSIONS
Informatica can handle all three types of slowly changing dimensions (SCD). You were specifically asking about Type 1 & 2. Type 1 SCD is to have the data load into the target table. If the record already exists, update the record and if it does not exist to simply insert it. There are 2 ways Informatica handles this. The first way is the easiest by simply creating a mapping that loads into the target table. Then flag the load as a "update else insert" load. Informatica will load the data into the target table issuing update statements. If the update statement fails because the record is not already in the target then the record will automatically be inserted into the table. This method does not require any special mapping techniques to be achieved, but the potential issue is performance. If a record needs to be inserted the load will first try and update the table and then insert the record - hence two commands. If the majority of records are inserts the load could potentially run 30-40% slower than the second method. The second method requires the mapping to have the logic built into to it to see if the current record being propagated is already in the target table. If it is the record is flagged as an update else it will be flagged as an insert. This is a more efficient way to handle the Type 1 SCD, but it requires a little more complexity to the mapping. Type 2 SCD is to have the data to be loaded into the target table. If the record exists, an update to the existing record is required as well as inserting the record with the changed information along with a new surrogate key (leveraging Informaticas sequence generator). This is handled within Informatica similar to the second option within Type 1 SCDs. A mapping will need the logic to see if the current record already exists in the target. If it does then the mapping will first update the existing record and populate an end-date field with a timestamp and then insert a new record into the table that populates the beginning-date field with the timestamp, but leaves the end-date field empty or perhaps a default date such as 12/31/2200. If the record does not exist in the target then it will be inserted populating the beginning-date with the timestamp and the end-date being empty or containing the default date in the future. This entire process is done in one basic mapping. Note that Informatica has a slowly changing dimension wizard that steps through the process of creating the mappings for the above two types of SCDs as well as Type 3.


-----Original Message-----
From: Venkatachalam, Mathialagan (CTS)
Sent: Sunday, March 07, 1999 4:01 PM
To: David Barham
Subject: RE: Slowly Changing Dimension


Hi David,

A typical example for "Slowly changing Dimension (SCD)" is a PRODUCT dimension in which the detailed description of a given product is occasionally changed/adjusted. Because of change in detail description of proudct, we need to track the changes in warehouse.

There are three main techniques for handling SCD in datawarehouse.

OVERWRITING: (simple one)

A Type One change overwrites an existing dimensional attribute with new information. In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost. A Type One change updates only the attribute, doesnt insert new records, and affects no keys.

PRESERVING HISTORY:
A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional data. Type Two changes let you preserve historical data. Implementing Type Two changes within a data warehouse might require significant analysis and development. Type Two changes accurately partition history across time more effectively than other types. However, because Type Two changes add records, they can significantly increase the databases size.

Preserving a version of history (Type Three). You usually implement Type Three changes only if you have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name. Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But how would you handle a second name change, or a third, and so on? The side effects of this method are increased table size and, more important, increased complexity of the queries that analyze historical values from these old fields. After more than a couple of iterations, queries become impossibly complex, and ultimately youre constrained by the maximum number of attributes allowed on a table.

Thata all about SCD.

Currenly Im working in both ETL tools Informatica and DataStage. In Informatica I can see separte Wizard to implement the SCD (all three techniques).

I would like to know how to implement the same in DataStage.

Hope this Helps!.

Thanks & Regards
Mathialagan


-----Original Message-----
From: David Barham [mailto:david@barham.hm]
Sent: Wednesday, March 07, 2001 3:32 AM
To: VMathial@chn.cognizant.com
Subject: RE: Slowly Changing Dimension


I am very familiar with data warehousing, OLAP tools and dimensions, but Im not quite sure what you are getting at. Can you please explain further?

Thanks,

David Barham
Information Technology Consultant
InformAtect Pty Ltd
Brisbane, Australia


-----Original Message-----
From: Venkatachalam, Mathialagan (CTS) [mailto:VMathial@chn.cognizant.com]
Sent: Wednesday, 7 March 2001 19:39
To: informix-datastage@oliver.com
Subject: Slowly Changing Dimension

Hello all,

Does any one know how to implement "Slowly changing Dimension" concept in DataStage ?

Thanks in Advance,
Mathi

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.

Visit us at http://www.cognizant.com


This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.

Visit us at http://www.cognizant.com
Locked