Page 1 of 1

Real time data processing using DataStage

Posted: Thu Jun 04, 2009 12:25 am
by abhinavsuri
Hi,

I have a project requirement wherein I need to process information sent from the source system in real time. For example ,when a new customer registers, this new customers data is sent immmediately to ETL. ETL should then trigger the job and process the file immediately. There could be a case where more than one file appears per minute.

One approach to achieve this could be by using a shell script to check for the arrival of a file. This script will run 24x7 and will invoke an instance of the job as soon as a file arrives.
Is this approach advisable?

However, I have also read about some plugin stages like MQ connector and Webservices stages.
Will these stages provide me with any additional functionality?
What are the advantages of these stages?
How exactly do these stages work?

Posted: Thu Jun 04, 2009 12:34 am
by abhinavsuri
Also pls provide me information as to what else is required for using these stages. Do we need to install anything else besides the additional stages?

Posted: Thu Jun 04, 2009 1:05 am
by ray.wurlod
Triggering the job is not real time. To get real time you need an always-running job that listens - maybe to an MQ series queue, maybe because it's been exposed as a web service using WISD components.

What you propose is certainly feasible, but will incur job startup time, which may not be acceptable if the requirement truly is "real time".

Posted: Wed Jun 10, 2009 2:13 pm
by asorrell
You need to post more details about "real-time" and about the front-end that provides the data. Is real-time near-instantaneous? Or is within a few minutes good enough?

I've worked on systems where a web-based front-end created MQ messages that were picked up by an always running DataStage job using MQ connectors. It then updated the database within seconds and sent confirmation back to the web-application. This isn't trivial to setup - and has maintenance implications as well (ie: what do you do if database is down?).

Posted: Thu Jun 11, 2009 1:38 am
by ArndW
I've done a couple of implementations that people called "real-time", but I much prefer to use "near real-time" or just to specify that the application needs to be synchronous. At present I am on an implementation using MQSeries for data transfer and there are host systems involved, pc clients as well as the UNIX servers for the transformation and repositories.

As mentioned before, if a DataStage job is constantly running and listening for data you will have near-real-time. If you have a job that gets started every 10 minutes and processes data then, for some, that is also near-real-time while other sites would consider that 10 minute delay inacceptable.