I'm going to describe briefly my architecture for a persistent staging database. As many data warehouses receive file feeds, it's important to have a method for landing files, then inducting into a queuing table for subsequent processing.
Every source file has a corresponding table. The DDL for all tables is the same:
Code: Select all
CREATE TABLE STG_RTK_CHAIN
(
FILE_SET_DATE DATE,
UNIQUE_ROW_IDENTIFIER NUMBER(15),
INDUCTION_BATCH_NBR NUMBER(14),
RAW_DATA_1 VARCHAR2(4000),
RAW_DATA_2 VARCHAR2(4000),
FORMAT_NUMBER NUMBER(3),
BUSINESS_DATE DATE,
ROW_STATUS_CODE VARCHAR2(10),
EDW_BATCH_NBR NUMBER(14),
ORIGINAL_DATA_1 VARCHAR2(4000),
ORIGINAL_DATA_2 VARCHAR2(4000)
)
Each file, after landing, is bulk loaded into the corresponding table. The raw data file, whether delimited or fixed width, is loaded into a single column (a spill-over column is available). This means there's no normalization or mappings done. It's simply induction. As part of this process, every row gets stamped with the file set date, a formatting number (more later), the business date of the data, a status ("I" for inducted). This is simply induction processing, content of the data doesn't matter. A generic bulk loader script, using the same control file for all tables, is used to load. The script has parameters to specify the source file and target table, otherwise, it's all the same. So, 100 or 1000 files and targets doesn't matter, this solution scales.
Every file has a format definition stored in another table. That table describes the layout of the file. Since files change format over time, you will have multiple formats for that file. Because the staging table doesn't enforce any data columns, it's a packed string, then you don't have any modeling changings when the file format shifts for expanded columns, additional columns, etc.
Code: Select all
CREATE TABLE ETL_STG_FORMATS
(
TABLE_NAME VARCHAR2(100),
FORMAT_NUMBER NUMBER(3),
BEGIN_EFFECTIVE_DATE DATE,
END_EFFECTIVE_DATE DATE,
COMMENTS VARCHAR2(1000),
ROW_MODIFICATION_DATE DATE
)
CREATE TABLE ETL_STG_COLUMNS
(
TABLE_NAME VARCHAR2(100),
FORMAT_NUMBER NUMBER(3),
COLUMN_NAME VARCHAR2(100),
COLUMN_POSITION NUMBER(4),
BYTE_POSITION NUMBER(4),
BYTE_LENGTH NUMBER(5),
COMMENTS VARCHAR2(1000),
ROW_MODIFICATION_DATE DATE
)
The last step is parsing the staging table for content. Since the layouts of a file is stored in a control table, you use a view that parses the data out of the packed raw data column. This is as simple as a view with a substring query design to dynamically parse the data via the control table layout.
Code: Select all
CREATE OR REPLACE FORCE VIEW DATASTAGE.VU_STG_RTK_CHAIN
(FILE_SET_DATE, UNIQUE_ROW_IDENTIFIER, INDUCTION_BATCH_NBR, FORMAT_NUMBER, BUSINESS_DATE,
ROW_STATUS_CODE, EDW_BATCH_NBR, CHAIN, CHAIN_NAME, MGR_NAME,
CURRENCY_CODE)
AS
(SELECT FILE_SET_DATE, UNIQUE_ROW_IDENTIFIER, INDUCTION_BATCH_NBR, FORMAT_NUMBER, BUSINESS_DATE, ROW_STATUS_CODE, EDW_BATCH_NBR, TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN'))) "CHAIN", TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN_NAME'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN_NAME'))) "CHAIN_NAME", TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='MGR_NAME'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='MGR_NAME'))) "MGR_NAME", TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CURRENCY_CODE'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CURRENCY_CODE'))) "CURRENCY_CODE" FROM STG_RTK_CHAIN);
So, whenever the ETL process that moves data from the staging database to the EDW has to select data, it will use this view to parse the data. If multiple days have queued in the staging area, it's no problem. Select everything with a status of "I", process it, then come back and update that selection set to "P" for processed, and give it the EDW batch number that handled it. Now of course, this is just one example of how to manage the data.
If more columns are added to the file, then you must setup the new format, when it became effective, and change any jobs that source the view to "see" the new columns. Of course, that data doesn't exist on legacy files, but the SQL in the view must handle that a column is not found for that format and returns NULL information. But, your processes don't break and you don't have to have different versions of jobs for different file layouts for different points in time. This can be heartbreaking if historically reprocessing older files that switch format/layout a lot.
My recommendation is an architect level suggestion. If your team has left this type of stuff to the ETL developer, you have my sympathies. I am a data warehouse architect, so my team doesn't have to fight to put structures like this in place.