A generic ETL job requirement

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsetlteam
Premium Member
Premium Member
Posts: 35
Joined: Mon Feb 10, 2014 10:14 pm
Location: USA

A generic ETL job requirement

Post by dsetlteam »

Hi Experts

I have a requirement wherein I have to create a generic job that will read from around 200 different delimited files and then populate data into their corresponding target tables (Netezza).Each file metadata is different and the delimiter can be either pipe,tab or comma. The other requirement is to capture any rejects into a reject table.

In order to achieve the above requirement - I created a RCP job , that will read the data from the file and the metadata from the schema file. In the schema file , I have maintained the data type as "Varchar" for all the columns.
The issue I am facing here is - How do I check for each columns validity? For eg:- For Col1 , I need to verify if it has correct date value or not. For Col2, I need to verify if its a valid integer value or not etc. . Keep in mind,I am not able to use a transformer stage to put up appropriate derivations on each column because these columns are not visible as they are defined via schema file and RCP.

Few options that I have tried are as follows.

1) Try to create a schema file with exact data types and create a reject link in the sequential file stage.
With this approach, invalid records are being captured in the reject link - however a warning is logged for the field wherein invalid data has been found for each record. In case there are too many records to be processed and are rejected, then this approach will lead to many warnings.

2) Keep all the data type as varchar in the schema file and try to enforce the reject on the target stage.
The Netezza connector doesn't allow to capture the rejects in a reject link, hence this option is ruled out.

In case you are aware of any way to code the above requirement, please let me know.

Thank You.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

RCP involves propagation (from input to output), nothing else.

If you want to validate, you need to name columns.

Maybe you can use generic column names. But generically validating is tricky. You need a rules engine of some kind, and a dispatch mechanism so that the correct rules are applied to each field.

This can be done, but generically would be a nightmare.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Maybe you could implement validation once the data have arrived in Netezza (maybe into staging tables)?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

you may not want to do this all in one job.
you might have, for example, a generic job (write dataset) or shared container (pass thru) that does the file reading, and another job/container that does say address validation, and another for dates, ... and you may have to string these together (reusing them where you can) differently for each source at the top level sequence job. You can re-use a lot of stuff this way but you do have to put a little bit of specific to source logic.

the other way is to stage everything to similar tables (address, date, whatever tables) and then have one job that can pull from the generic tables and populate from that... both ways involve pros, cons, and a bit of work to get it right.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

The Sequential File stage can do basic validation (there are certain settings you may need to enable to be stricter on data type - such as APT_IMPORT_ENFORCE_BOUNDED_LENGTH and APT_IMPORT_REJECT_INVALID_CHARS. Details:

http://www.ibm.com/support/knowledgecen ... nment.html

Please note: When you are creating schema files, you must use the Parallel Engine's data types (string, int32, et cetera), NOT the DataStage's data type (varchar, integer, et cetera). Details here:

http://www.ibm.com/support/knowledgecen ... types.html

-T.J.
dsetlteam
Premium Member
Premium Member
Posts: 35
Joined: Mon Feb 10, 2014 10:14 pm
Location: USA

Post by dsetlteam »

Hi Guys

Thanks so much for the pointers.We brainstormed in between the team here and thought about different ETL solutions, and finally we came to the conclusion to push all the incoming data directly into the target tables by defining all the target fields as varchar. During consumption of the data, we will apply the data type validations and thereby reject the records if needed.

The ETL solution I build earlier (schema file and RCP) is now working, however I still see that there are warnings logged in the director for each field. One such example warning is as follows.

Netezza_Connector_34: Schema field FIRST_NAME in input data set 0 not present in design time information. Extended field properties for this field will not be available.

I am trying to figure out how to remove the above warning.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

Just use the message handler if you do not want to see the message. This is particularly frustrating since Netezza Connector have some dependencies on the design time's osh code having the details for those columns (take a peek at the osh code sometimes, huge mess of XML data in there for Connectors).

-T.J.
Post Reply