Loading a set of more than 100 Pipe delimited text files

Sohail · Post by **Sohail** » Wed Jan 28, 2004 10:55 am

I have question about loading more than 100 pipe delimited text files with different record formats into our Data Warehouse Landing area, each file will be loaded to a single oracle table.

So, my question is what is the best way to load these files. I know one way 1 job per file which will have the file layout and table layout. But it means a lot of jobs etc. Is there a best way to handle this instead of creating 100 jobs

Is there a way to create some dynamic job which can read pipe delimited files and load to a table passed as parameters

Any help is appreciated:)

Thanks,

Sohail.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 28, 2004 3:11 pm

Because the record layout is different, you're going to need 100 separate DataStage jobs.

Your only other possibility is to create one generic job, that reads one "column" per row, and the world's most horrible parsing to handle the 100 different record layouts.

Using the KISS principle, do the 100 jobs. Since you're using 7.x, create a shell in a template job, then load the individual table definitions from the text files and from the Oracle tables into separate jobs created from that template. Which, of course, you've previously imported into your repository so that you can load them with just a click of the Load button. :D

Sohail · Post by **Sohail** » Wed Jan 28, 2004 3:26 pm

ray.wurlod wrote: Using the KISS principle, do the 100 jobs. Since you're using 7.x, create a shell in a template job, then load the individual table definitions from the text files and from the Oracle tables into separate jobs created from that template. Which, of course, you've previously imported into your repository so that you can load them with just a click of the Load button. :D

Thanks for the reply, I didn't understand creating shell in the template job and load the table definition from text files? Could you please elaborate this little more.

We already have all our Oracle Table definitions from ERWin in the DataStage via MetaStage/MetaBroker. The file layout, I think, has to be created manually in the each job if that is different than the final table layout?

Thanks,

Sohail.

vmcburney · Post by **vmcburney** » Wed Jan 28, 2004 5:30 pm

Version 7 of DataStage lets you save a job as a template so you can use that template whenever you create new jobs. In your case your 100 or so jobs are all very similar so you can create a detailed template that includes all the job parameters, such as the file path and database login details, a sequential file input stage, a transformer and a database bulk load or OCI stage.

This makes for very fast creation of each job as you just have to go in and set the file name, table name and column definitions.

You have all the table definitions from ER Win, you don't need to create the file definitions manually, you can import them into DataStage using the Import command from the DataStage Manager.

Can anyone out there comment on whether it is better to import the flat file definitions into MetaStage and then DataStage or go directly into DataStage first?

It is a lot easier if you have column names at the start of each file. If you have column names then the import step will give you the field definitions and the column names. If you don't have column names in the first row then you need to enter them manually which can be quite time consuming.

If you have all your file layouts documented then consider putting them into a word or Excel document and using search and replace functions to turn each file layout into a single row of pipe delimited headings. Paste each heading into the first column of each data file so the column names get loaded automatically on import.

kcbland · Post by **kcbland** » Wed Jan 28, 2004 6:19 pm

I'm going to describe briefly my architecture for a persistent staging database. As many data warehouses receive file feeds, it's important to have a method for landing files, then inducting into a queuing table for subsequent processing.

Every source file has a corresponding table. The DDL for all tables is the same:

Code: Select all

CREATE TABLE STG_RTK_CHAIN
(
  FILE_SET_DATE          DATE,
  UNIQUE_ROW_IDENTIFIER  NUMBER(15),
  INDUCTION_BATCH_NBR    NUMBER(14),
  RAW_DATA_1             VARCHAR2(4000),
  RAW_DATA_2             VARCHAR2(4000),
  FORMAT_NUMBER          NUMBER(3),
  BUSINESS_DATE          DATE,
  ROW_STATUS_CODE        VARCHAR2(10),
  EDW_BATCH_NBR          NUMBER(14),
  ORIGINAL_DATA_1        VARCHAR2(4000),
  ORIGINAL_DATA_2        VARCHAR2(4000)
)

Each file, after landing, is bulk loaded into the corresponding table. The raw data file, whether delimited or fixed width, is loaded into a single column (a spill-over column is available). This means there's no normalization or mappings done. It's simply induction. As part of this process, every row gets stamped with the file set date, a formatting number (more later), the business date of the data, a status ("I" for inducted). This is simply induction processing, content of the data doesn't matter. A generic bulk loader script, using the same control file for all tables, is used to load. The script has parameters to specify the source file and target table, otherwise, it's all the same. So, 100 or 1000 files and targets doesn't matter, this solution scales.

Every file has a format definition stored in another table. That table describes the layout of the file. Since files change format over time, you will have multiple formats for that file. Because the staging table doesn't enforce any data columns, it's a packed string, then you don't have any modeling changings when the file format shifts for expanded columns, additional columns, etc.

Code: Select all

CREATE TABLE ETL_STG_FORMATS
(
  TABLE_NAME             VARCHAR2(100),
  FORMAT_NUMBER          NUMBER(3),
  BEGIN_EFFECTIVE_DATE   DATE,
  END_EFFECTIVE_DATE     DATE,
  COMMENTS               VARCHAR2(1000),
  ROW_MODIFICATION_DATE  DATE
)

CREATE TABLE ETL_STG_COLUMNS
(
  TABLE_NAME             VARCHAR2(100),
  FORMAT_NUMBER          NUMBER(3),
  COLUMN_NAME            VARCHAR2(100),
  COLUMN_POSITION        NUMBER(4),
  BYTE_POSITION          NUMBER(4),
  BYTE_LENGTH            NUMBER(5),
  COMMENTS               VARCHAR2(1000),
  ROW_MODIFICATION_DATE  DATE
)

The last step is parsing the staging table for content. Since the layouts of a file is stored in a control table, you use a view that parses the data out of the packed raw data column. This is as simple as a view with a substring query design to dynamically parse the data via the control table layout.

Code: Select all

CREATE OR REPLACE FORCE VIEW DATASTAGE.VU_STG_RTK_CHAIN
(FILE_SET_DATE, UNIQUE_ROW_IDENTIFIER, INDUCTION_BATCH_NBR, FORMAT_NUMBER, BUSINESS_DATE, 
 ROW_STATUS_CODE, EDW_BATCH_NBR, CHAIN, CHAIN_NAME, MGR_NAME, 
 CURRENCY_CODE)
AS 
(SELECT FILE_SET_DATE, UNIQUE_ROW_IDENTIFIER, INDUCTION_BATCH_NBR, FORMAT_NUMBER, BUSINESS_DATE, ROW_STATUS_CODE, EDW_BATCH_NBR, TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN'))) "CHAIN", TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN_NAME'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CHAIN_NAME'))) "CHAIN_NAME", TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='MGR_NAME'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='MGR_NAME'))) "MGR_NAME", TRIM(SUBSTR(RAW_DATA_1, (SELECT BYTE_POSITION FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CURRENCY_CODE'),(SELECT BYTE_LENGTH FROM ETL_STG_COLUMNS WHERE TABLE_NAME = 'STG_RTK_CHAIN' AND ETL_STG_COLUMNS.FORMAT_NUMBER=STG_RTK_CHAIN.FORMAT_NUMBER AND ETL_STG_COLUMNS.COLUMN_NAME='CURRENCY_CODE'))) "CURRENCY_CODE" FROM STG_RTK_CHAIN);

So, whenever the ETL process that moves data from the staging database to the EDW has to select data, it will use this view to parse the data. If multiple days have queued in the staging area, it's no problem. Select everything with a status of "I", process it, then come back and update that selection set to "P" for processed, and give it the EDW batch number that handled it. Now of course, this is just one example of how to manage the data.

If more columns are added to the file, then you must setup the new format, when it became effective, and change any jobs that source the view to "see" the new columns. Of course, that data doesn't exist on legacy files, but the SQL in the view must handle that a column is not found for that format and returns NULL information. But, your processes don't break and you don't have to have different versions of jobs for different file layouts for different points in time. This can be heartbreaking if historically reprocessing older files that switch format/layout a lot.

My recommendation is an architect level suggestion. If your team has left this type of stuff to the ETL developer, you have my sympathies. I am a data warehouse architect, so my team doesn't have to fight to put structures like this in place.

kduke · Post by **kduke** » Wed Jan 28, 2004 7:51 pm

Ken

Your solution is pretty cool but yours has fixed length fields and this is pipe delimited. I think if it was me then I would write one job and export it to a DSX file. Look at the relationship between the DSX file and the target table's metadata. Then write a program generator to build the DSX files. I think all you may need to do is move some columns around. This should save you a lot of time.

If you are bulk loading then Oracle or Toad can do this for you. All you need is the control file. It will be the same issue in that it maybe in the wrong order.

kcbland · Post by **kcbland** » Wed Jan 28, 2004 7:56 pm

I used fixed length for this particular file. The method is sound, you could also use field extractions. In this clients case their database du jour is Oracle, which has no inherent SQL function for field extraction, just substring (substr). So, we elected to go with a fixed width format for the packed data. But, for other database's you can go with delimited easily. Otherwise, we did write a user function to deal with delimited files.

Anyway, the technique is coool! It is unbelievable scalable, and is only a small piece of the large induction system that also tracks file arrival and metrics. It's integrated with data quality inspection (the "original" columns are used to hold the original data, in case data was fixed in the source data).