The ETL Staging Database (Sandbox) concept is not dead yet

Moderators: chulett, rschirm

Post Reply
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

The ETL Staging Database (Sandbox) concept is not dead yet

Post by kcbland »

Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Re: The ETL Staging Database (Sandbox) concept is not dead y

Post by datastage »

Everyone has their own terms, but when I have a database between the source and target systems that has some value to someone in terms of querying and reporting I call it an ODS. When I have a database just for the purposes of making ETL easier I refer to it as a 'staging area'. Ross and Kimball seem to agree stating "the data warehouse's staging area should be off-limits to the business users and reporting/delivery application developers". They also say this:

the data warehouse staging area is also highly concerned about data quality, integrity, and consistency. Incoming data is checked for reasonable quality as it enters the staging area. {text omitted here} Yes, that puts extra burden on the data staging team, but it's done is the spirit of delivering a better, more consistent product to the data warehouse patrons. "

The funny thing is as an ETL developer I always feel guilty when creating a staging area (ie, something that isn't of value to anyone but the ETL team). Not to mention although it delivers a higher quality solution it seems hard as a consultant to get the buy in on the extra time it takes to create a quality staging database. I can see the burning thought in their mind, "They built an extra database that isn't even used? what a waste!"

Any thoughts of how to convince of the need of a formal staging database or whether its better to ETL without one to maybe complete a project a month sooner?



shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Re: The ETL Staging Database (Sandbox) concept is not dead y

Post by shawn_ramsey »

We have one also and just by chance we also call it an ODS. :) We use it to normalize the dimensional data before loading it into the data warehouse.
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Just to be careful, an ODS and a sandbox are different things. You can accomplish similar goals, though.

The concept is that you have fully transformed, cleansed, verified, and approved ALL of your data across ALL of your target tables for a given load cycle prior to insert/update into your EDW. Data does not hover in mid-air when you shift from dimension processing to fact processing, and then aggregate processing.

Glad someone out there's reading Kimball too!
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Any thoughts of how to convince of the need of a formal staging database or whether its better to ETL without one to maybe complete a project a month sooner?
I think we have all had problems justifying or explaining ODS, Staging and OLAP databases and designs. Most of us have had issues explaining why they are designed wrong. Sometimes you have to let them do it wrong because they are paying the bills. After the fact I have seen them blame me for their poor designs even when I have memos explaining to them how wrong their decision or design was. Most of what we do is changed by perception. It should be more about what kind of results we got. You have to fit in to a certain degree or quit.

As more and more people understand what an ODS is versus a staging area then you will see better and better designs and solutions. What this topic points out is that data warehouse developers confuse ODS and staging areas or terms because of poor implementations. What I call "textbook wrong" implementations. How do you fight that battle against ignorant managers? Worse what happens when you do and get labeled as a trouble maker or hard to get along with. I am sure Ken has had these issues. Good consultants are not wimps. They are not easy to get along with. They have strong opionions because they have studied their craft. They know how to implement good solutions. They have paid their "dues" according to TJ.

I think it boils down to the difference between ignorance and stupid. Not knowing something is no sin but to stay that way is not professional. A professional does not stop studing because he gets his college degree whether it is a BS, MS or doctorate. A piece of paper does not make you good at something. If you cannot build a professional solution from start to finish then what good are you? The final product should be all that counts.

A staging area is a scratch pad. All the data is said to be in transition or temporary. To allow this data to be reported against means it is no longer temporary. You have broken your ETL by creating dependencies before you have delivered "clean" data. I have delivered solutions like this. Does not mean that I liked doing it or even that I approved of the design. Most of the time we do not have enough control or power to force a good or "textbook correct" solution. We have to live within the boundaries of the managers in charge.

A good example was the customer wanted a real time report. They wanted this report to look like the reports in the datamart. The source for the report came from the ODS and so was the source for the datamart. We created temporary tables which duplicated the datamart tables. We populated these tables with a stored procedure written in PLSQL. This procedure had about 6 cursors. It was a very ugly solution but they got their real time report. If it failed then it was my fault. If it was a poor design then it was my fault. How is that possible when I told them it was a bad idea? I was the only one with the talent to pull it off so it had to be my fault.

There will be a quiz so what is an ODS? How is it different than a staging area? Operational Data Store is "clean" operational data. A lot of times this data is used by web front ends because it is easier to build a web page on top of a relational database than on top of a mainframe. At some point these companies need to get rid of their mainframes and they know it. This data usually was created for reporting purposes. Why? Because Business Objects and MicroStrategy did not work against mainframe data. Companies wanted the power of these tools in the hands of their users. I am sure Ken or Hester can give you a much better textbook answer but this is my perception of how things evolved.

The original question was how do you get them to see the value in using a staging area. The answer is educate them. Not easy. We cannot seem to get the users on this forum to see the value in learning these terms much less in developing good data warehouses. You want to be a professional then read the books and understand the terms. Next learn how to implement good solutions using the ideas presented by Kimball and/or Inmon.

Based on your question you have done your homework. You are looking for better ways to implement "textbook correct". Most of the time the politics of a project is more of a problem than the development.
Mamu Kim
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

more on staging areas

Post by datastage »

Does the ETL staging area need to be formal, and by that I mean stored in a DBMS with a DBA assigned to it, with good naming conventions, good metadata, all ETL developers are on the same page as to its design, maintenance, and intended use? I know I've been places were developers typically worked on separate modules or areas, and did their own thing in terms of staging (whether it be hashed file, UV tables, flat files, or temporary tables in Oracle or similar). In the end every developer delivered excellent results so it didn't seem a problem that the staging wasn't formalized or standardized. Of course one developer was clueless at to what the other's code did until they could look at it for a while, but that is a sympton of of pushing the ETL team too fast in that there wasn't time to document or discuss each others work with the other team members on not as much of a product of the type of staging.

And similarly, what is the added cost of staging to a database management system that at least require some DBA work versus staging to hashed or flat files?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Getting buy-in for the staging area isn't too difficult give clients who are prepared to listen intelligently to rational argument. In essence there must be two (minimum) weapons in the consultant's armoury here; a few horror stories about where it went pear-shaped, and why (none of these examples has a staging area, of course), and a few success stories about how the application of best practice demonstrably led to success, speed to implementation (= lowered cost), and all those other good things.
Ultimately the staging area makes it easier to guarantee that the query databases (DWs, DMs, ODSs) contain "good" data - for any particular value of "good", which hinges upon business rules among other things.
If you have good data, you can make good decisions. If you don't, you can't. And the penalties for making bad business decisions can be dire indeed. Scare tactics? Absolutely!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Does the ETL staging area need to be formal, and by that I mean stored in a DBMS with a DBA assigned to it, with good naming conventions, good metadata, all ETL developers are on the same page as to its design, maintenance, and intended use?
I think this is a project not well managed. No documentation. Jobs not standardized. I think each type of job needs a template with naming conventions and well documented. If you are consistent then it is easier to support and modify in the future.

Break your jobs down into categories.

1. Source to sequential.
2. Sequential to hash.
3. Hash to dimension tables applying surrogate keys.
4. Sequential to fact tables applying surrogate keys.
5. Aggregate jobs.
6. Whatever.

Create a template for each of these. You may what folders for each of these. Folders based on subject area may also be necessary. Pick one of these to be a subfolder to the other.

In the above categories the staging area is either hash files or sequential files. Not much effort is needed to manage and standardize these areas. To manage metadata then you need to track which job writes to these files and which jobs reads from these files. MetaStage can easily manage this but the idea is when you change the metadata to one job then all the other jobs need to be modified.

Ken has some great ideas based on how to make your jobs restartable by using sequential files. Do a search. Some are in his white papers. Very solid methodology.

If you don't manage a project by understanding that your metadata needs to be accurately shared between jobs then you will pay the price by making mistakes in the future. Errors in the accuracy of your data is not easy to find especially when you do not have reports telling you which jobs have files in common. Metadata on tables is sort of managed by the database. If you miss a column then you just get nulls in that column. Reading from a table is not an issue. It will not break your ETL.

Staging area by its very nature can be left alone and your ETL may never break. The more complex a project then the more you need to manage all phases of it. In other words the more professional your solution then the more important the details are. If you are satisfied that you have "excellent results" then don't change it. From my experience the best developers have execellent work habits and try to do all phases well. They also understand the impact of not doing it well. There are times you need something done now. Do it fast always eliminates some professionalism. We have all had to do that at some point.
Mamu Kim
Post Reply